Group Lasso with Overlaps: 
the Latent Group Lasso approach 



Guillaume Obozinski* 

Sierra team - INRIA 
Ecole Normale Superieure 
(INRIA/ENS/CNRS UMR 8548) 
Paris, France 



GUILLAUME. OBOZINSKI@ENS.FR 



Laurent Jacob* 



LAURENT@STAT.BERKELEY.EDU 



Department of Statistics 
University of California 
Berkeley CA 94720, USA 



Jean-Philippe Vert 



Jean-Philippe.Vert@mines.org 



Centre for Computational Biology 

Mines ParisTech 

Fontainebleau, F- 77300, France 

INSERM U900 

Institut Curie 

Paris, F-75005, France 



We study a norm for structured sparsity which leads to sparse linear predictors whose 
supports are unions of predefined overlapping groups of variables. We call the obtained 
formulation latent group Lasso, since it is based on applying the usual group Lasso penalty 
on a set of latent variables. A detailed analysis of the norm and its properties is presented 
and we characterize conditions under which the set of groups associated with latent variables 
are correctly identified. We motivate and discuss the delicate choice of weights associated 
to each group, and illustrate this approach on simulated data and on the problem of breast 
cancer prognosis from gene expression data. 

Keywords: group Lasso, sparsity, graph, support recovery, block regularization, feature 
selection 

1. Introduction 

Sparsity has triggered much research in statistics, machine learning and signal process- 
ing recently. Sparse models are attractive in many application domains because they lend 
themselves particularly well to interpretation and data compression. Moreover, from a sta- 
tistical viewpoint, betting on sparsity is a way to reduce the complexity of inference tasks 
in large dimensions with limited amounts of observations. While sparse models have tradi- 
tionally been estimated with greedy feature selection approaches, more recent formulations 
as optimization problems involving a non-differentiable convex penalty have proven very 
successful both theoretically and practically. The canonical example is the penalization of 
a least-square criterion by the i\ norm of the estimator, known as Lasso in statistics (Tib- 
shirani, 1996) or basis pursuit in signal processing (Chen et al., 1998). Under appropriate 
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assumptions, the Lasso can be shown to recover the exact support of a sparse model from 
data generated by this model if the covariates are not too correlated (Wainwright, 2009; 
Zhao and Yu, 2006). It is consistent even in high dimensions, with fast rates of convergence 
(Bickel et al., 2009; Lounici, 2008). We refer the reader to van de Geer (2010) for a detailed 
review. 

While the i\ norm penalty leads to sparse models, it does not encode any prior in- 
formation about the structure of the sets of covariates that one may wish to see selected 
jointly, such as predefined groups of covariates. An extension of the Lasso for the selection 
of variables in groups was proposed under the name group Lasso by Yuan and Lin (2006), 
who considered the case where the groups form a partition of the sets of variables. The 
group Lasso penalty, also called i\ji2 penalty, is defined as the sum (i.e. , t\ norm) of the 
£2 norms of the restrictions of the parameter vector of the model to the different groups of 
covariates. The work of several authors shows that when the support can be encoded well 
by the groups defining the norm, support recovery and estimation are improved (Huang 
and Zhang, 2010; Kolar et al., 2011; Lounici et al., 2010, 2009; Negahban and Wainwright, 
2011; Obozinski et al., 2010). 

Subsequently, the notion of structured sparsity emerged as a natural generalization of the 
selection in groups, where the support of the model one wishes to recover is not anymore 
required to be just sparse but also to display certain structure. One of the first natural 
approaches to structured sparsity has been to consider extensions of the £i/£2 penalty to 
situations in which the set of groups considered overlap, so that the possible support pattern 
exhibit some structure (Bach, 2009; Zhao et al., 2009). Jenatton et al. (2011) formalized this 
approach and proposed an £i/£2 norm construction for families of allowed supports stable 
by intersection. Other approaches to structured sparsity are quite diverse: Bayesian or non- 
convex approaches that directly exploit the recursive structure of some sparsity patterns 
such as trees (Baraniuk et al., 2010; He and Carin, 2009), greedy approaches based on 
block-coding (Huang et al., 2009), relaxation of submodular penalties (Bach, 2010), generic 
variational formulations (Micchelli et al., 2011). 

While Jenatton et al. (2011) proposed a norm inducing supports that arise as inter- 
sections of a sub-collection of groups defining the norm, we consider in this work norms 
which, albeit defined as well by a collection of overlapping groups, induce supports that are 
rather unions of a sub-collection of the groups encoding prior information. The main idea 
is that instead of directly applying the £i/£2 norm to a vector, we apply it to a set of latent 
variables each supported by one of the groups, which are combined linearly to form the 
estimated parameter vector. In the regression case, we therefore call our approach latent 
group Lasso. 

The corresponding decomposition of a parameter vector into latent variables calls for the 
notion of group- support, which we introduce and which corresponds to the set of non-zero 
latent variables. In the context of a learning problem regularized by the norm we propose, 
we study the problem of group-support recovery, a notion stronger than the classical support 
recovery. Group-support recovery typically implies support recovery (although not always) 
if the support of a parameter vector is exactly a union of groups. We provide sufficient 
conditions for consistent group-support recovery. 

In the definition of our norm, a weight is associated with each group. These weights play 
a much more important role in the case of overlapping groups than in the case of disjoint 
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groups, since in the former case they determine the set of recoverable supports and the 
complexity of the class of possible models. We discuss the delicate question of the choice of 
these weights. 

While the norm we consider is quite general and has potentially many applications, we 
illustrate its potential on the particular problem of learning sparse predictive models for 
cancer prognosis from high-dimensional gene expression data. The problem of identifying a 
predictive molecular signature made of a small set of genes is often ill-posed and so noisy 
that exact variable selection may be elusive. We propose that, instead, selecting genes 
in groups that are involved in the same biological process or connected in a functional 
or interaction network could be performed more reliably, and potentially lead to better 
predictive models. We empirically explore this application, after extensive experiments on 
simulated data illustrating some of the properties of our norm. 

To summarize, the main contributions of this paper, which rephrases and extends a 
preliminary version published in Jacob et al. (2009), are the following: 

• We define the latent group Lasso penalty to infer sparse models with unions of prede- 
fined groups as supports, and analyze in details some of its mathematical properties. 

• We introduce the notion of group-support and group-support recovery results. Us- 
ing correspondence theory, we show under appropriate conditions, that, in a classical 
asymptotic setting, estimators for the linear regression regularized with fiy are con- 
sistent for the estimation of a sufficiently sparse group- support 

• We discuss in length the choice of weights associated to each group, which play a 
crucial role in the presence of overlapping groups of different sizes. 

• We provide extended experimental results both on simulated data — addressing 
support-recovery, estimation error and role of weights — and on breast cancer data, 
using biological pathways and genes networks as prior information to construct latent 
group Lasso formulations. 

The rest of the paper is structured as follows. We first introduce the latent group Lasso 
penalty and position it in the context of related work in Section 3. In Section 4 we show 
that it is a norm and provide several characterizations and variational formulations; we also 
show that regularizing with this norm is equivalent to covariate duplication (Section 4.6) 
and derive a corresponding multiple kernel learning formulation (Section 4.7). We briefly 
discuss algorithms in Section 4.8. In Section 5, we introduce the notion of group-support and 
consider in Section 6 a few toy examples to illustrate the concepts and properties discussed 
so far. We study group support-consistency in Section 7. The difficult question of the choice 
of the weighting scheme is discussed in Section 8. Section 9 presents the latent graph Lasso, 
a variant of the latent group Lasso when covariates are organized into a graph. Finally, 
in Section 10, we present several experiments: first, on artificial data to illustrate the gain 
in support recovery and estimation over the classical Lasso, as well as the influence of the 
choice of the weights; second, on the real problem of breast cancer prognosis from gene 
expression data. 
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2. Notations 

In this section we introduce notations that will be used throughout the article. For any 
vector w E M p and any q > 1, \\w\\ q = (X^Li I w i l^) 1 ^ denotes the ^ norm of w. We 
simply use the notation ||w|| = ||w|| 2 for the Euclidean norm, supp (w) C denotes the 
support of w, i.e., the set of covariates i E [l,p] such that 0. A group of covariates is a 

subset # C The set of all possible groups is therefore the power set of 

For any group g, g c = [l,p]\g denotes the complement of g in i.e., the covariates 

which are not in g. Ii g : M p — >> MP denotes the projection onto {w : = for i E g c }, 
i.e., n^w is the vector whose entries are the same as w for the covariates in g, and are 

for other other covariates. We will usually use the notation = II^w. We say that two 
groups overlap if they have at least one covariate in common. 

Throughout the article, Q C V([l,p\) denotes a set of groups, usually fixed in advance 

for each application, and we denote m = \Q\ the number of groups in Q. We require that 
all covariates belong to at least one group, i.e., 

]Jg = [iM- 

g£G 

We note Vg C M px ^ the set of m-tuples of vectors v = (v 9 ) ge g, where each w 9 is a vector 
in M p , that satisfy supp (v 9 ) C g for each g E Q. 

For any differentiate function / : MP R, we denote by V/(w) E M p the gradient of / 
at w E M p and by V^/(w) E M 9 the partial gradient of / with respect to the covariates in 

9> 

In optimization problems throughout the paper we will use the convention that jj = 

— 2 

so that the M- valued function (x, y) — is well defined and jointly convex on M x R + . 



3. Group Lasso with overlapping groups 

Given a set of groups Q which form a partition of the group Lasso penalty (Yuan and 

Lin, 2006) is a norm over M p defined as : 

VwEM^, \\w\\ h/h =^2d g \\w g \\ , (1) 

geG 

where (dg) ge g are positive weights. This is a norm whose balls have singularities when some 
are equal to zero. Minimizing a smooth convex loss functional L : M p — > M over such a 
ball, or equivalently solving the following optimization problem for some A > : 

min L(w) + d g ||w^|| , (2) 



geG 



often leads to a solution that lies on a singularity, i.e., to a vector w such that w g = for 
some of the groups g in Q. Equivalently, the solution is sparse at the group level, in the sense 
that coefficients within a group are usually zero or nonzero together. The hyperparameter 
A > in (2) is used to adjust the tradeoff between minimizing the loss and finding a solution 
which is sparse at the group level. 
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Figure 1: (a) Left: Effect of penalty (1) on the support: removing any group containing a 
variable removes the variable from the support. When variables in groups 1 and 
3 are shrunked to zero, the support of the solution consists of the variables of the 
second group which are neither in the first, nor in the third, (b) Right: Latent 
decomposition of w over {y 9 ) v ^g: applying the i\/i2 penalty to the decomposition 
instead of applying it to the removes only the variables which do not belong 
to any selected group. The support of the solution if latent vectors vi and V3 are 
shrunked to zero will be all variables in the second group. 



When Q is not a partition anymore and some of its groups overlap, the penalty (1) is 
still a norm, because we assume that all covariates belong to at least one group. However, 
while the Lasso is sometimes loosely presented as selecting covariates and the group Lasso 
as selecting groups of covariates, the group Lasso estimator (2) does not necessarily select 
groups in that case. The reason is that the precise effect of non-differentiable penalties is 
to set covariates, or groups of covariates, to zero, and not to select them. When there is 
no overlap between groups, setting groups to zero leaves the other full groups to nonzero, 
which can give the impression that group Lasso is generally appropriate to select a small 
number of groups. When the groups overlap, however, setting one group to zero shrinks its 
covariates to zero even if they belong to other groups, in which case these other groups will 
not be entirely selected. This is illustrated in Figure 1(a) with three overlapping groups of 
covariates. If the penalty leads to an estimate in which the norm of the first and of the 
third group are zero, what remains nonzero is not the second group, but the covariates of 
the second group which are neither in the first nor in the third one. More formally, the 
overlapping case has been extensively studied by Jenatton et al. (2009), who showed that in 
the case where L(w) is an empirical risk and under very general assumptions on the data, 
the support of a solution w of (2) almost surely satisfies 

supp (w) = (J g\ 
\g£Go J 

for some Qo C £?, i.e., the support is almost surely the complement of a union of groups. 
Equivalently, the support is an intersection of the complements of some of groups considered. 
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In this work, we are interested in penalties which induce a different effect: we want the 
estimator to select entire groups of covariate, or more precisely we want the support of the 
solution w to be a union of groups. For that purpose, we introduce a set of latent variables 
v = (v g ) ge g such that v 9 E MP and supp (v 9 ) C g for each group g <E G, and propose to 
solve the following problem instead of (2): 

min L(w) + A V d Q \\v 9 \\ s.t. w = Vv 9 . (3) 

Problem (3) is always feasible since we assume that all covariates belong to at least one 
group. Intuitively, the vectors v = (y g ) g in (3) represent a decomposition of w as a sum 
of latent vectors whose supports are included in each group, as illustrated in Figure 1(b). 
Applying the ^1/^2 penalty to these latent vectors favors solutions which shrink some v 9 
to 0, while the non-shrunk components satisfy supp (v 5 ') = g. On the other hand, since we 
enforce w = J2 ge g v 5 ', a can be nonzero as long as i belongs to at least one non-shrunk 
group. More precisely, if we denote by Gi C Q the set of groups g with / for the 
solution of (3), then we immediately get w = J2 g eg 1 v g \ and therefore we can expect: 

supp (w) = (J g . 

In other words, this formulation leads to sparse solutions whose support is likely to be a 
union of groups. 

Interestingly, problem (3) can be reformulated as the minimization of the cost function 
L(w) penalized by a new regularizer which is a function of w only. Indeed since the 
minimization over v only involves the penalty term and the constraints, we can rewrite (3) 
as 

minL(w) + AOg(w), (4) 

with 

fig(w) = min V d g || || . (5) 



We call this penalty the latent group Lasso penalty, in reference to its formulation as a 
group Lasso over latent variables. When the groups do not overlap and form a partition, 
there exists a unique decomposition of w E M p as w = J2 ge g v 9 with supp (v 9 ) C g, namely, 
v 9 — w g for all g E Q. In that case, both the group Lasso penalty (1) and the latent group 
Lasso penalty (5) are equal and boil down to the same standard group Lasso. When some 
groups overlap, however, the two penalties differ. For example, Figure 2 shows the unit ball 
for both norms in R 3 with groups Q = {{1, 2}, {2, 3}}. The pillow shaped ball of || • W^/^ 
has four singularities corresponding to cases where either only wi or only W3 is nonzero. 
By contrast, fiy has two circular sets of singularities corresponding to cases where (wi, W2) 
only or (w2,W3) only is nonzero. For comparison, we also show the unit ball when we 
consider the partition Q — {{1,2}, {3}}, in which case both norms coincide: singularities 
appear for (wi,W2) = or W3 = 0. 

To summarize, we enforce a prior we have on w by introducing new variables in the 
optimization problem (3). The constraint we impose is that some groups should be shrunk 



6 




Figure 2: Unit balls for || • H^/^ (left), proposed by Jenatton et al. (2009), and Jly (middle), 
proposed in this paper, for the groups Q — {{1,2}, {2,3}}. W2 is represented as 
the vertical coordinate. We note that singularities exist in both cases, but occur 
at different positions: for || • ||^ 1 /^ 2 they correspond to situations where only wi 
or only W2 is nonzero, i.e., where all covariates of one group are shrunk to 0; for 
f}§? they correspond to situations where only wi or only W3 is equal to 0, i.e., 
where all covariates of one group are nonzero. For comparison, we show on the 
right the unit ball of both norms for the partition Q — {{1, 2}, {3}}, where they 
both reduce to the classical group Lasso penalty. 
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to zero, and a covariate should have zero weight in w if all the groups to which it belongs 
are set to zero. Equivalently, the support of w should be a union of groups. This new 
problem can be re-written as a classical minimization of the empirical risk, penalized by 
a particular penalty fiy defined in (5). This penalty itself associates to each vector w the 
solution of a particular constrained optimization problem. While this formulation may not 
be the most intuitive, it allows to reframe the problem in the classical context of penalized 
empirical risk minimization. In the remaining of this article, we investigate in more details 
the latent group Lasso penalty fiy, both theoretically and empirically. 

3.1 Related work 

The idea of decomposing a parameter vector into some latent components and to regularize 
each of these components separately has appeared recently in the literature independently 
of this work. In particular Jalali et al. (2010) proposed to consider such a decomposition 
in the case of multi-task learning, where each task specific parameter vector is decomposed 
into a first £\ regularized vector and another vector, regularized with an £\/£oo norm; so as 
to share its sparsity pattern with all other tasks. The norm considered in that work could 
be interpreted as a special case of the latent group Lasso, where the set of groups consists of 
all singletons and groups of coefficients associated with the same feature across task. The 
decomposition into latent variables is even more natural in the context of the work of Chen 
et al. (2011), Candes et al. (2009), or Agarwal et al. (2011) on robust PCA and matrix 
decomposition in which a matrix is decomposed in a low rank matrix regularized by the 
trace norm and a sparse or column-sparse matrix regularized by an £\ or group ^i-norm. 

Another type of decompositions which is related to this norm is the idea of cover of the 
support. In particular it is interesting to consider the £q counterpart to this norm, which 
could be written as 

= min d g s.t. w = v 5 ', supp (v^) C g . 

Oq can then be interpreted as the value of a min set-cover. This penalization has been 
considered in Huang et al. (2009) under the name block coding, since, indeed, when d g is 
interpreted as a coding length, this penalization induces a code length on all sets, which 
can be interpreted in the MDL framework. 

More generally, one could consider penalties, for all q > 0, by replacing the £2 
norm used in the definition of the latent group Lasso penalty (5) by a £ q norm. It should 
be noted then that, unlike the support, the definition of group-support we introduce in 
Section 5 changes if one considers the latent group Lasso with a different ^ g -norm, and even 
if the weights d g change 1 . 

Obozinski and F. (2011) considers the case of fi^, when the weights are given by a set 
function and shows that Q~j is then the tightest convex u £ q relaxation of the block-coding 
scheme of Huang et al. (2009). It also shows that when Q — 2 V and the weights are an 
appropriate power of a submodular function then Q!^ is the norm that naturally extends 
the norm considered by Bach (2010). 

1. We discuss the choice of weights in detail in Section 8. 
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It should be noted that recent theoretical analyses of the norm studied in this paper 
have been proposed by Percival (2011) and Maurer and Pontil (2011). They adopt points 
of views or focus on questions that are complementary of this work; we discuss those in 
section 7.3. 

4. Some properties of the latent group Lasso penalty 

In this section we study a few properties of the latent group Lasso fiy, which will be in 
particular useful to prove consistency results in the next section. After showing that 0§ 
is a valid norm, we compute its dual norm and provide two variational formulas. We then 
characterize its unit ball as the convex hull of basic disks, and compute its subdifferential. 
When used as a penalty for statistical inference, we further reinterpret it in the context of 
covariate duplication and multiple kernel learning. To lighten notations, in the rest of the 
paper we simply denote fiy by ft. 

4.1 Basic properties 

We first analyze the decomposition induced by (5) of a vector w E W as J2 g eQ v9 ' We 
denote by V(w) C Vg the set of m-tuples of vectors v = (v g ) ge g E Vg that are solutions to 
the optimization problem in (5), i.e., which satisfy 

w = ^v^ and fi(w) = 5^rf y ||v^|| . 

We first have that: 

Lemma 1 For any w E MP, V(w) is non-empty, compact and convex. 

Proof The objective of problem (5) is a proper closed convex function with no direction 
of recession. Lemma 1 is then the consequence of classical results in convex analysis, such 
as Theorem 27.2 page 265 of Rockafellar (1997). ■ 

The following statement shows that, unsurprisingly, we can regard ft as a classical norm- 
based penalty. 

Lemma 2 w \-> f2(w) is a norm. 

Proof Positive homogeneity and positive definiteness hold trivially. We show the triangular 
inequality. Consider w, w' E MP, and let v E V(w) and v' E V(w') be respectively optimal 
decompositions of w and w', so that fi(w) = J2 g d g \\v 9 \\ and fi(w') = J2 g d g \\v fg \\ with 
w = J2 g vfif an d = J2 g v ,g . Since [w 9 + v ,g ) ge g is a (a priori non-optimal) decomposition 
of w + w 7 , we clearly have : 

ft(w + w 7 ) <J2 d 9 ll v * + v 'l ^J2 d 9 (ll v ^ll + ll v 'l) = ^(w) +ft(w 7 ). 

geQ g eQ 
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4.2 Dual norm and variational characterizations 

Q being a norm, by Lemma 1, we can consider its Fenchel dual norm ft* defined by: 



Va e R p , fT(a) = sup |w T a | ft(w) < l) . (6) 



The following lemma shows that Q* has a simple closed form expression: 
Lemma 3 (dual norm) The Fenchel dual norm fi* of ft satisfies: 



Va E R p , ft* (a) = max d' 1 \\ct g \ 



Proof We start from the definition of the dual norm (6) and compute: 



ft* (a) — max w a s.t. ft(w) < 1 

— max w T a s.t. w — >^v^, >a \\v 9 \\ < 1 



max 



max 

veVg 



J2v gT a s.t. 5^^||v^||<l 

geQ g eQ 

J2v gT cx s.t. J^%<1, V#E£,^K|| <T) g 



g^g g^g 



+ 



max d n 1 ||a G 

9 



The second equality is due to the fact that : 

{w | ft(w) < 1} = {w | 3v E s.t. w = ^v^, || || < 1} , 
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and the fifth results from the explicit solution v 9 = ot g r]gd g 1 \\ct g \\ 1 of the maximization 
in V in the fourth line. ■ 



Remark 4 Remembering that the infimal convolution f *i n f g of two convex functions f 
and g is defined as (f Mnf #)( w ) — m f vG ^p {/(v) + g(w — v)} (see Rockafellar, 1997), it 
could be noted that ft is the infimal convolution of all functions uo g for g E Q defined as 
u g : w i— ||wp||^(w) with ^(w) = if supp(w) C g and +oc otherwise. One of the main 
properties motivating the notion of infimal convolution is the fact that it can be defined via 
(/*infflO* — where * denotes Fenchel- Leg endre conjugation. Several of the properties 

of ft can be derived from this interpretation but we will however show them directly. 

The norm ft was initially defined as the solution of an optimization problem in (5). 
From the characterization of ft* we can easily derive a second variational formulation: 
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Lemma 5 (second variational formulation) For any w E M p , we have 

Q(w) — max a T w s.t. \\ct Q \\ < d Q for all g E Q . (7) 

Proof Since the bi-dual of a norm is the norm itself, we have the variational form 

ft(w) = max a T w s.t. fi*(a) < 1 . (8) 

Plugging the characterization of f2* of Lemma 3 into this equation finishes the proof. ■ 

For any w E MP, we denote by A(w) the set of ol E MP in the dual unit sphere which solve 
the second variational formulation (7) of fi, namely: 

A(~w) = argmax a T w . (9) 

aGEP, Q*(a)<l 

With a few more efforts, we can also derive a third variational representation of the norm 
f2, which will be useful in Section 7 in the proofs of consistency: 

Lemma 6 (third variational formulation) For any w E M p , we also have 

n w = \ s m E y^Y + E 4 x * ■ ( 10 ) 

Proof For any w G P, we can rewrite the solution of the constrained optimization problem 
of the second variational formulation (7) as the saddle point of the Lagrangian: 

Q(w) = min max w T a X a ( llaJI 2 — eft,) . 



+ geG 

Optimizing in ct leads to ol being solution of = OLi^2 g3i \ g , which (distinguishing the 
cases = and ^ 0) yields problem (10) when replacing oli by it optimal value. ■ 

Let us denote by A(w) C M™ the set of solutions to the third variational formulation 
(10). Note that there is not necessarily a unique solution to (10), because the Hessian 
of the objective function is not always positive definite (see lemma 48 in Appendix D 
for a characterization of cases in which positive definiteness can be guaranteed). For any 
w E MP, we now have three variational formulations for fi(w), namely (5), (7) and (10), 
with respective solutions sets V(w), *4(w) and A(w). The following lemma shows that 
V(w) is in bijection with A(w). 

Lemma 7 Let w E MP. The mapping 



X:Vg^M m 

v ^ A(v) = (dg 1 ||v 5 



, (11) 

'geG 



is a bijection from V(w) to A(w). For any A E A(w) ; the only vector v E V(w) that 
satisfies A(v) = A is given by v| = X g OL g , where ol is any vector of A(w). 
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Proof To express the penalty as a minimization problem, let us use the following basic 
equality valid for any x E M+: 



1 . 

x — - mm 

2 77>0 



x 



V 



+ T] 



where the unique minimum in 77 is reached for 77 = x. From this we deduce that, for any 
v E W and d > : 



d llvll — - min d 

2 77>0 



+ T] 



1 

- min 

2 A'>0 



+ d 2 \' 



where the unique minimum in the last term is attained for \' — d _1 ||v||. Using definition 
(5) we can therefore write Jl(w) as the optimum value of a jointly convex optimization 
problem in v E Vg and A' = (\ f g ) ^ Q E : 



fi(w) 



mm 



2 ^ 



rii v 11 



+ d 2 g \' g 



(12) 



where for any v, the minimum in A' is uniquely attained for A' = A(v) defined in (11). 
By definition of V(w), the set of solutions of (12) is therefore exactly the set of pairs of 
the form (v,A(v)) for v E V(w). Let us now isolate the minimization over v in (12). To 
incorporate the constraint J2 g eQ v9 — w we rewrite (12) with a Lagrangian: 



fKw) = min max min 



-Y 

2 n 



r || || 



+ d 2 9 \' g 



+ a /T (w 



The inner minimization in v, for fixed A 7 and a', yields vf = X'gOt^. The constraint w = 
Y1ogG w9 therefore implies that, after optimization in v and a', we have a' = ^ Wi v , 

w^. A small computation now shows that, after 



and as a consequence that vf 
optimization in v and oc f for a fixed A', we have: 



K } 



Wv9\\ z v 

£^ = £E 



=1 sOz 



= ££, 



a; 



= 2. 



1 ^/l 



Plugging this into (12), we see that after optimization in v, the optimization problem in X f 
is exactly (10), which by definition admits A(w) as solutions, while we showed that (12) ad- 
mits A (V(w)) as solutions. This shows that A (V(w)) = A(w), and since for any A' E A(w) 

there exists a unique v E V(w) that satisfies A(v) = A 7 , namely, vf = ^ — ^rw ? ;, A is 



indeed a bijection from V(w) to A(w). Finally, we noted in the proof of Lemma 6 that for 
any A E A(w) and ol E *4(w), = ctiJ2h3i^h- This shows that the unique v E V(w) 
associated to a A E A(w) can equivalent ly be written vf 
proof of Lemma 7. 



X g ct g , which concludes the 
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4.3 Characterization of the unit ball of ft as a convex hull 

Figure 2(b) suggests visually that the unit ball of is just the convex hull of a horizontal 
disk and a vertical one. This impression is correct and formalized more generally in the 
following lemma. 

Lemma 8 For any group g E Q , define the hyperdisks V g — {w E MP | ||w^|| < d" 1 , w g c = 
0}. Then, the unit ball of £1 is the convex hull of the union of hyper-disks U ge g V g . 

Proof Let w E ConvHull( U ge g ^g)-> then there exist a 9 E V g and t g E R+, for all 
g E £?, such that J2 g eQ ^9 — ^ an< ^ w = ^geQ tg * 9 ' Letting v = (t g ct 9 ) ge g as a suboptimal 
decomposition of w, we easily get 

n(w)<J2dg\\t 9 CX 9 \\<^2t g <l. 

geg geg 

Conversely, if fi(w) < 1, then there exists v E Vg, such that J2 g eQ^g \\ v9 \\ — 1 an( ^ we 
obtain a 9 E V g and t in the simplex by letting t g — d g \\v 9 \\ and 

if t g = , 




It should be noted that this lemma shows that Q is the gauge of the convex hull of 
the disks V gi in other words, ft is, in the terminology introduced by Chandrasekaran et al. 
(2010), the unit ball of the atomic norm associated with the union of disks V g . 

4.4 Subdifferential of Q 

The subdifferential of £1 at w is, by definition: 

9fi(w) = {s E R p | Vh E R p , ft(w + h) - fi(w) > s T h} . 

It is a standard result of convex optimization (resulting e.g. from characterization (b*) of 
the subdifferential in Theorem 23.5, p. 218, Rockafellar, 1997) that for all w E W, dft(w) = 
v4(w), where *4(w) was defined in (9). 

We can now show a simple relationship between the decomposition (-v 9 ) ge g of a vector 
w induced by f2, and the subdifferential of Q. 

Lemma 9 For any oc E *4(w) — <9fi(w) and any v E V(w) ; 

either ^ and oc g — d g ^^ , 
or v 9 — and \\oc g \\ < d g . 

Proof Let v E V(w) and a E ^l(w). Since fi*(a) < 1, we have \\ct g \\ < d g which 
implies a T v# < d g \\v 9 \\. On the other hand, we also have a T w = fi(w) so that = 
fi(w) — a T w = J2 g (dg ll v ^ll ~~ a J vfif ) ' wn i cn is a sum °f non-negative terms. We conclude 
that, for all g E £/, we have ajv# = d g \\v 9 \\ which yields the result. ■ 

We can deduce a general property of all decompositions of given vector: 
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Corollary 10 Let w E W. For all v, v' E V(w), and /or all g £ Q we have w 9 = or 
v ,g — or there exists 7 E R swc/i t/iat v 5 ' = 7v /5f . 



Proof By Lemma 9, if v 9 ^ and v /5f 7^ 0, then oc g = dg-^g^ = ^][^]f so that 
- W v9 W m 



4.5 ft as a regularizer 

We consider in this section the situation where ft is used as a regularizer for an empirical 
risk minimization problem. Specifically, let us consider a convex differentiable loss function 
f : R x 1 4 I, such as the squared error y) = (t — y) 2 for regression problems or 
the logistic loss l(t,y) = log(l + e~ yt ) for classification problems where y = ±1. Given 
a set of n training pairs (xW,yW) E x I, i = 1, . . . ,n, we define the empirical risk 
L(w) = ^ X^^Li ^(w T x^\ i/^^) and consider the regularized empirical risk minimization 
problem 

min L(w) + Afi(w) . (13) 

Its solutions are characterized by optimality conditions from subgradient calculus: 

Lemma 11 A vector w E MP is a solution of (13) if and only if one of the following 
equivalent conditions is satisfied 

(a) — VL(w)/A E A(w) 

(b) w can be decomposed as w = J2 g eQ w9 f or some v E mt/i /or a// g £ Q: 

either w g ^ and V^L(w) = -\d g w 9 / \\v 9 \\ or = and d" 1 ||V^L(w)|| < A. 

Proof (a) is immediate from subgradient calculus and the fact that dft (w) = *4(w) (see 
Section 4.4). (b) is immediate from Lemma 9. ■ 



4.6 Covariate duplication 

In this section we show that empirical risk minimization penalized by ft is equivalent to a 
regular group Lasso in a covariate space of higher dimension obtained by duplication of the 
covariates belonging to several groups. This has implications for practical implementation 
of as a regularizer and for its generalization to non-linear classification. 
More precisely, let us consider the duplication operator: 

R p -> R^geo \9\ 
x ^x = 0( x .). Gp . (14) 

In other words, x is obtained by stacking the restrictions of x to each group on top of each 
other, resulting in a yJ2 g eg \o\ ) -dimensional vector. Note that any coordinate of x that 
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occurs in several groups will be duplicated as many times in x. Similarly, for a vector v G Vg, 
let us denote by v the (^2 gE g \g\^j -dimensional vector obtained by stacking the restrictions 
of the successive v 9 on their corresponding groups g on top of each other (resulting in no 
loss of information, since v 9 is null outside of g). This operation is illustrated in (18) below. 
Then for any w E M p and v E Vg such that w = J2 ge g v 5 ', we easily get, for any x E R p : 



w T x 



= E 



v fl,T x = v T x . 



(15) 



Consider now a learning problem with training points x 1 , . . . , x n E M p where we minimize 
over w E M p a penalized risk function that depends of w only through inner products with 
the training points, i.e., or the form 



min L(Xw) + Afi(w) , 



(16) 



where X is the n x p matrix of training points and Xw is therefore the vector of inner 
products of w with the training points. Many problems, in particular those considered in 
Section 4.5, have this form. By definition of we can rewrite (16) as 

min L(Xw) + A ) d q \W 9 \\ , 

y gey 



which by (15) is equivalent to 



ve. 



min L(Xv) + \J2 d 9 W^gW > 

^aGG \9\ ^— ' 



(17) 



geQ 



where X is the n x (J2 g(E g \g\) matrix of duplicated training points, and w g refers to the 
restriction of v to the coordinates of group g. In other words, we have eliminated w from 
the optimization problem and reformulated it as a simple group Lasso problem without 
overlap between groups in an expanded space of size J2 ge g \g\- 

On the example of Figure l,with 3 overlapping groups, this duplication trick can be 
rewritten as follows : 



Xw — X. 



+ X. 



+ X. 



( X 0i i X <?2 5 X ; 



^93; 



(18) 



This formulation as a classical group Lasso problem in an expanded space has several im- 
plications, detailed in the next two sections. On the one hand, it allows to extend the penalty 
to non-linear functions by considering infinite-dimensional duplicated spaces endowed with 
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positive definite kernels (Section 4.7). On the other hand, it leads to straightforward imple- 
mentations by borrowing classical group Lasso implementations after feature duplications 
(Section 4.8). Note, however, that the theoretical results we will show in Section 7, on the 
consistency of the estimator proposed, are not mere consequences of existing results for the 
classical group Lasso, because, in the case we consider, not only is the design matrix X rank 
deficient, but so are all of its restriction to sets of variables corresponding to any union of 
overlapping groups. 

4.7 Multiple Kernel Learning formulations 

Given the reformulation in a duplicated variable space presented above, we provide in this 
section a multiple kernel learning (MKL) interpretation to the regularization by our norm 
and show that it extends naturally the case with disjoint groups. 

To introduce it, we return first to the concept of MKL (Bach et al., 2004; Lanckriet 
et al., 2004) which we can present as follows. If one considers a learning problem of the 
form 

H = min L(Xw) + -||w|| 2 , (19) 
wgrp 2 

then by the representer theorem the optimal value of the objective H only depends on the 
input data X through the Gram matrix K = XX T , which therefore can be replaced by any 
positive definite (p.d.) kernel between the datapoints. Moreover H can be shown to be a 
convex function of K (Lanckriet et al., 2004). Given a collection of p.d. kernels Ki, . . . , 
any convex combination K = Yli=i Vi^i with rji > and J2 i rfi = 1 is itself a p.d. kernel. 
The multiple kernel learning problem consists in finding the best such combination in the 
sense of minimizing H: 

mmHfcriiKi) s.t. V^ = l. (20) 

The kernels considered in the linear combination above are typically reproducing kernels 
associated with different reproducing kernel Hilbert spaces (RKHS). 

Bach et al. (2004) showed that problems regularized by a squared ^1/^2-norm and mul- 
tiple kernel learning were intrinsically related. More precisely he shows that, if Q forms a 
partition of {1, ... letting problems (P) and (P f ) be defined through 

(p L% n / (Xw) + ^ E ^^ KI ') 2 and (p,) ^^(^% K ,)^-E^ = 1 ' 

with = X^Xj, then (P) and (P') are equivalent in the sense that the optimal values of 
both objectives are equal with a bijection between the optimal solutions. Note that such 
an equivalence does not hold if the groups g E Q overlap. 

Now turning to the norm we introduced, using the same derivation as the one leading 
from problem (16) to problem (17), we can show that minimizing L(Xw) + |fi(w) 2 w.r.t. w 

is equivalent to minimizing L(Xv) + § ( J2 g \\v 9 \\) and setting w = Y^geG v ^ At this P oint ' 
the result of Bach et al. (2004) applied to the latter formulation in the space of duplicates 
shows that it is equivalent to the multiple kernel learning problem 

»ff(E se g%K s ) s.t. ^^ = 1, with K 9 = X 9 Xj. (21) 
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This shows that minimizing L(Xw) + ^fi(w) 2 is equivalent to the MKL problem above. 
Compared with the original result of Bach et al. (2004), it should be noted now that, 
because of the duplication mechanism implicit in our norm, the original sets g E Q are no 
longer required to be disjoint. In fact this derivation shows that, in some sense, the norm 
we introduced is the one that corresponds to the most natural extension of multiple kernel 
learning to the case of overlapping groups. 

Conversely, it should be noted that, while one of the application of multiple kernel 
learning is data fusion and thus allows to combine kernels corresponding to functions of 
intrinsically different input variables, MKL can also be used to select and combine elements 
from different function spaces defined on the same input. In general these function spaces 
are not orthogonal and are typically not even disjoint. In that case the MKL formulation 
corresponds implicitly to using the norm presented in this paper. 

Finally, another MKL formulation corresponding to the norm is possible. If we denote 
Ki = XiXT the rank one kernel corresponding to the zth feature, then we can write 
Kg = J2ieg^i- If B E W Xm is the binary matrix defined by = lj^}, and Z = 
{Br) | 77 E R+, J2 ge gVg — 1} i s the image of the canonical simplex of W 71 by the linear 
transformation associated with B, then with £ E Z obtained through Q = J2 g 3i r lgi the 
MKL problem above can be reformulated as 

mm H^CiKi). (22) 

1=1 

This last formulation can be viewed as the structured MKL formulation associated with 
the norm Q (see Bach et al., 2011, sec. 1.5.4). It is clearly more interesting computationally 
when m ^> p. It is however restricted to a particular form of kernel ~K g for each group, 
which has to be a sum of feature kernels K^. In particular, it doesn't allow for interactions 
among features in the group. 

In the two formulations above, it is obviously possible to replace the linear kernel used 
for the derivation by a non-linear kernel. In the case of (21) the combinatorial structure 
of the problem is a priori lost in the sense that the different kernels are no longer linear 
combinations of a set of "primary" kernels, while this is still the case for (22). 

Using non-linear kernels like RBF, or kernels on discrete structures such as sequence- 
or graph-kernels may prove useful in cases where the relationship between the covariates in 
the groups and the output is expected to be non-linear. For example if g is a group of genes 
and the coexpression patterns of genes within the group are associated with the output, 
the group will be deemed important by a non linear kernel while a linear one may miss it. 
More generally, it allows for structured non-linear feature selection. 

4.8 Algorithms 

There are several possible algorithmic approaches to solve the optimization problem (13), 
depending on the structure of the groups in Q. The approach we chose in this paper is based 
on the reformulation by covariate duplication of section 4.6, and applies an algorithm for 
the group Lasso in the space of duplicates. To be specific, for the experiments presented in 
section 10, we implemented the block-coordinate descent algorithm of Meier et al. (2008) 
combined with the working set strategy proposed by Roth and Fischer (2008). Note that 
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the covariate duplication of the input matrix X needs not to be done explicitly in computer 
memory, since only fast access to the corresponding entries in X is required. Only the vector 
v which is optimized has to be stored in the duplicated space ^ and is potentially of 
large dimension (although sparse) if Q has many groups. 

Alternatively, efficient algorithms which do not require working in the space of dupli- 
cated covariates are possible. Such an algorithm was proposed by Mosci et al. (2010) who 
suggested to use a proximal algorithm, and to compute the proximal operator of the norm 
via an approximate projection on the unit ball of the dual norm in the input space. To avoid 
duplication, it would also be possible to use an approach similar to that of (Rakotomamonjy 
et al., 2008). Finally, one could also consider algorithms from the multiple kernel learning 
literature. 

5. Group-support 

A natural question associated with the norm ft is what sparsity pattern are elicited when 
the norm is used as a regularizer. This question is natural in the context of support recovery. 
If the groups are disjoint, one could equivalently ask which patterns of selected group are 
possible, since answering the latter or the former questions are equivalent. This suggest a 
view in which the support is expressed in terms of groups. We formalize this idea through 
the concept of group-support of a vector w, which, put informally, is the set of groups that 
are non-zero in a decomposition of w. We will see that this notion is useful to characterize 
induced decompositions and recovery properties of the norm. 

5.1 Definitions 

More formally, we naturally call group-support of a decomposition v E Vg, the set of groups 
g such that v 9 ^ 0. We extend this definition to a vector as follows: 

Definition 12 (Strong group-support) The strong group-support £/i(w) of a vector 
w E MP is the union of the group- supports of all its optimal decompositions, namely: 

&(w) = {g E G | 3v E V(w) s.t w 9 + } . 

If w has a unique decomposition v(w), then ^i(w) = {g E Q | v fl, (w) ^ 0} is the group- 
support of its decomposition. We also define a notion of weak group-support in terms of 
uniqueness of the optimal dual variables. 

Definition 13 (Weak group-support) The weak group-support of a vector w E W is 
£ x (w) = {g E Q\3ot g E W s.t U g A(vr) = {oc g } and \\ol 9 \\ = d g } . 

It follows immediately from Lemma 9 that ^i(w) C <?i(w). When ^i(w) = (?i(w), we refer 
to G\{yv) as the group-support of w; otherwise we say that the group-support is ambiguous. 

The definitions of strong group-support and weak group-support are motivated by the 
fact that in the variational formulation (8), the strong group-support is the set of groups 
for which the constraints \\ct g \\ < 1 are strongly active whereas the weak group-support is 
the set of weakly or strongly active such constraints (Nocedal and Wright, 2006, p. 342). We 
illustrate these two notions on a few examples in Section 6. 
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5.2 Supports induced by the group-support 

For any w E M p , we denote by Ji(w) (resp. Ji(w)) the set of variables in groups of the 
weak group-support (resp. strong group-support): 

j i( w ) = IJ 5 and ^i(w) = y 5. 

Since £/i(w) C £/i(w), it immediately follows that Ji(w) C Ji(w) 2 . The following two 
lemmas show that, on Ji(w), any dual variables a E *4(w) are uniquely determined. 

Lemma 14 // Ji(w)\ Ji(w) 7^ 0, t/ien /or any a E *4(w) ; « Jl(w )\j l(w ) = 0. 

Proof Note that w Ji(w) ^j i(w) = since = for g E <?i(w)\£i(w). Let g E <?i(w)\£i(w). 
If g\Ji(w) 7^ 0, and if II^j^ w ^(w) 7^ {0} then, let i E s\«/i(w) such that there exists 
a E v4(w) with 7^ 0. Setting ol{ = leads to another vector that solves the second 
variational formulation (7) and such that \\ct g \\ < d g which contradicts the hypothesis that 
#E<?i(w). ■ 

Lemma 15 For any w E M, p , IIj 1 ( w ) v 4(w) is a singleton, i.e., there exists aj x ( w )(w) E 
K |Ji(w)| snc ^ that ^ j or aU a / E ^4( w ) 5 a' Ji(w) a Jl(w) (w). 

Proof By definition of Ji(w), for all i E Ji(w) there exists at least one v E V(w) and 
one group g 3 z, such that (v 5 ^ 7^ 0. Now as a consequence of Lemma 9, for any two 
solutions a, a/ E *4(w), we have that ol 9 — cx' g — c^^p so in particular oli — cy! { . For 

i E Ji(w)\Ji(w), Lemma 14 shows that a, = 0. ■ 



6. Illustrative examples 

In this section, we consider a few examples that illustrate some of the properties of fi, 
namely situations where weak and strong group support differ, or where there is an entire 
set of optimal decompositions. We will abuse notations and write v g for v| when writing 
explicit decompositions. We will denote by Sign the correspondence (or set- valued function) 
defined by Sign(x) = 1 if x > 0, Sign(x) = — 1 if x < and Sign(0) = [—1,1]. 

6.1 Two overlapping groups 

We first consider the case p = 3 and Q — {{1, 2}, {2, 3}}. 

Lemma 16 We have f2(w) = II (1^2? |^i| + I^D^I- If ( w ii w 3) 0, the optimal decompo- 
sition is unique with 

V{i2} (m , 1 p^j 1 w 2 ) and v {23} ( , , w 2 , ^3) , (23) 

V |^l| + |^3| / MWl| + |W3| ' 

2. It is possible to have Ji(w) 7^ Ji(w) consider (J = {{1, 2}, {1, 3}, {2, 3, 4}} and w = ^(1, /i, 1 — /i, 0) for 
any /i £ (0, 1). We then have Qx = {{1, 2}, {1, 3}} and £1 = so that Ji = {1, 2, 3} 7^ Ji = {1, 2, 3, 4}. 
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*4( w ) = { ((|wi| + |w 3 |)7i, w 2 , (|^i| + |^3|)7 3 )/fi(w) | 7i E Sign(^),z E {1,3}}, 

J\ — J\ — supp(w) and Gi = Gi includes {^1,^2} ^1 and {1^2,1^3} if ^3 7^ 0. 

If (wi,ws) = 0, then ^{12} — (0 5 7^2) T and V{ 2 3} = ((1 — 7)^2 ? 0) T is an optimal 

decomposition for any 7 E [0,1], *4(w) = {(0, sign(w2), 0)}, J\ — J\ — {1,2,3} and C?i = 

ai = a. 

We prove this lemma in section C.l.l (as a special case of the "cycle of length three" which 
we consider next). Here, the case where the decomposition is not unique seems to be a 
relatively pathological case where the true support is included in the intersection of two 
groups. However, note that the weak group-support and strong-group support coincide, 
even in the latter case. 



6.2 Cycle of length 3 

We now turn to the case p = 3 and Q = {{1, 2}, {2, 3}, {1, 3}}. Note that if at least one of 
the groups is not part of the weak-group support, we fall back on the case of two overlapping 
groups. We therefore have the following lemma: 

Lemma 17 Define Wbal = {w E M 3 | \wi\ < Hw^H , i E {1, 2, 3}}. We have 
ttg (w) 

If \supp(w) I 7^ 1 the optimal decomposition is unique. If in addition, w E Wbal we have for 
(i, j, k) E {(1,2, 3), (2, 3,1), (3,1,2)}; 

= \ (N+K'|-KI) (^[^.)) and ^( w ) = {^{sign(w 1 ),sign(w 2 ),sign(w s ))}. 

Moreover, we have J\ = J\ = {1, 2, 3} ; £7i = (5 and for w E W& a Z; Gi — Gi — G - 

We prove this lemma in appendix C.l, and illustrate it on Figure 3 with the unit ball of 
the obtained norm. In this case it is interesting to note that the group-support (weak or 
strong) is not necessarily a minimal cover, where we say that a set of groups provides a 
minimal cover if it is impossible to remove a group while still covering the support. For 
instance, for w in the interior of Wbal? the group-support contains all three groups, while 
the support is covered by any two groups. This is clearly a consequence of the convexity 
of the formulation. The cycle of length 3 is also interesting because, for any w on the 
boundary of Wbal? the weak and strong group-support do not coincide, as illustrated on 
Figure 3 (right). Indeed if for example |^3| = \wi\ + |u>2|, then V{ 12 } = (0, 0) T , V{ 13 } = 
I (signal), sign(u>3)) T and V{ 2 ,3} = |^2|(sign(^2), sign(it;3)) T so that by lemma 9 the 
dual variable satisfies ||c*{i 5 2}|| — 1? which means that {1,2} is in the weak but not in the 
strong group-support. 




ifw E Wbal 
min iG{1?25 3 } || (wi, ||w W c||i) || else. 
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Figure 3: (Left) The unit ball of Q for the groups {1, 2}, {1, 3}, {2, 3} in R 3 . (Right) a 
diagram that represents the restriction of the unit ball to the positive orthant. 
The black lines separate the surface in four regions. The triangular central region 
is Wbai- On the interior of each region and on the colored outer boundaries, 
the group-support is constant, non-ambiguous (i.e., the weak and strong group- 
supports coincide) and represented by color bullets or the color of the edge, with 
one color associated to each group. On the boundary of Wbal? the black lines 
indicate the group-support is ambiguous, the weak group-support containing all 
three groups, and the strong group-support being equal to that of the outer 
adjacent region for each black segment. 
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6.3 Cycle of length 4 

We consider the case p = 4 and show the following result in appendix C.2. 

Lemma 18 For Q = {{1, 2}, {1, 3}, {2, 4}, {3, 4}}. Q has the closed form 

f2(w) ||(|wi| + |w 4 |, |w 2 | + |w 3 |)|| 2 • 

However, if \supp(w) \ = 4 ; the optimal decomposition is never unique. 

This suggests that for a general G, unique solutions are the exception rather than the rule. 
This motivates a posteriori definitions of group-support that are meaningful in the case 
where the decomposition is not unique. We consider a necessary and sufficient condition 
for uniqueness in lemma 48. 

7. Model selection consistency 

In this section we consider the estimator w obtained as a solution of the learning problem 
(13) in the context of a well-specified model. Specifically, we consider the linear regression 
model: 

y = Xw* + e , (24) 

where X E W lXp is a design matrix, y E W is the response vector and e E MP is a 
vector of i.i.d. random variables with mean and finite variance. We denote by w* the 
true regression function, and by w the one we estimate as the solution of the following 
optimization problem, which is a particular case of (13): 

1 2 

min — lly — Xwll + A n fi(w) . (25) 
wgMp 2n 

Several types of consistency results are of interest when using a sparsity-inducing norm as 
a regularizer. One typically distinguishes classical consistency where ||w — w*|| converges 
in probability to zero, prediction consistency where \L{w) — L(w*)| converges to zero in 
probability, and model selection consistency or support recovery where the support of w 
coincides with the support of w* with high probability. We are interested in the discussion 
of the last type of result, support recovery, for solutions of (25). 

As compared with the Lasso and the group Lasso in the case of disjoint supports, the 
discussion of support recovery is complicated by several factors here. First, supports that 
can be recovered are not exactly the ones that can be expressed as unions of groups in 
Q\ as the reader might expect, the appropriate notion of support is Ji(w*) (or Ji(w*)), 
the one induced by the concept of group-support introduced in section 5. Second, by 
contrast with the situation of the group Lasso with disjoint groups, the identification of 
the support Ji(w*) (or Ji(w*)) is not equivalent to the identification of the group-support 
£?i( w *) (o r ^i( w *)) 5 the latter being now a harder problem. As a consequence one should 
distinguish support recovery from group-support recovery, and, depending on the context, 
the appropriate notion to consider for model selection consistency might be one or the other. 
Third, the group-support is characterized by properties of the set V(w) whose convergence 
is less trivial to study than that of a vector. For these reasons, we consider only in this 
paper the classical asymptotic regime in which the model generating the data is of fixed 
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finite dimension p while n — >> oo. However we focus on the harder problem of group-support 
recovery, which will then imply support recovery results. 

The proof of consistency we present below follows a classical proof scheme (Bach, 2008a). 
However the originality of our work reside in that we characterize the group-support consistency 
here, which requires in particular to study the convergence of the set- valued map V(w). 
We therefore start in the next section by introducing appropriate notions of continuity for 
set-valued functions. 

7.1 Correspondence theory to the rescue 

We appeal to the theory of correspondences developed by Claude Berge at the end of the 
1950's (Berge, 1959). In particular, we follow closely its presentation by Border (1985). 

Definition 19 (correspondence) A correspondence (j) from a set X to a set Y , denoted 
(j) : X -» Y ; is a set-valued mapping which to each element x E X associates a set <p(x) C Y. 

When X and Y are metric spaces, the usual notion of continuity of a function is replaced 
for correspondences by the following notions: 

Definition 20 (hemicontinuity and continuity) Given two metric spaces (X, d) and 
(Y,p), a correspondence <p • X -» Y is said to be upper hemicontinuous or u.h.c. (resp. 
lower hemicontinuous or l.h.c.) if for any point x E X and any open set U C Y such that 
4>{x) C U (resp. <j){x) fl U ^ 0) there exists a neighborhood V of x such that, for all x r E V , 
(j)(x f ) C U (resp. (f)(x f ) fl U ^ 0). A correspondence is said to be continuous if it is both 
upper and lower hemicontinuous. 

Note that a singleton valued correspondence <j) can be identified with the function / taking 
this unique value, and that / is continuous if and only if (j) is lower or upper hemicontinu- 
ous, both notions being equivalent in that case. The following results, which we prove in 
appendix A, are key to study the consistency of our method in the next section. 

Lemma 21 w4 *4(w) is an upper hemicontinuous correspondence. 

Lemma 22 If supp(w) = J\, then, on the domain V — {u E MP | supp(u) = J\}, 
u i— >> V(w + u) is a continuous correspondence at \x — 0. 

7.2 Group-support recovery 

In this section, we state and prove our main consistency results for group-support and 
support recovery in the least-square linear regression framework (24). We consider two 
main hypotheses: 

(HI) E = -X T X y , (H2) supp (w*) = Ji(w*) . 

n 

We denote ^(w*) = £/\£?i(w*) and J2(w*) = [l,p]\Ji(w*). For convenience, for any 
group of covariates g we note the n x | g \ design matrix restricted to the covariates in 
g, and for any two groups g,g' we note ^ gg > — ^XjX^/. 
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Consider the following two conditions, where we denote Ji(w*) simply by J\ for sake of 
clarity: 

"<dg, (CI) 



(C2) 



Theorem 23 Under assumption (HI), for X n and A^n 1 / 2 — >> oo, conditions (CI) and 
(C2) are respectively necessary and sufficient for the strong group-support of the solution of 
(13), £/i(w) to satisfy with probability tending to 1 as n — » +oc; 

&(w) C&(w*). 



Proof We follow the line of proof of Bach (2008a) but consider a fixed design for simplicity 
of notations. Let us first consider the subproblem of estimating a vector only on the support 
of w* by using only the groups in £7i(w*) in the penalty, i.e., consider wi E M Jl a solution 
of 

mm — ||y - Xj.wj, || -h f2 u (wjj . 

By standard arguments, we can prove that wi converges in Euclidean norm to w* restricted 
to Ji as n tends to infinity (Knight and Fu, 2000). In the rest of the proof we show how 
to construct a vector w E M p from wi which under condition (C2) is with high probability 
a solution to (25). By adding null components to wi, we obtain a vector w E M p whose 
support is also Ji, and u = w — w* therefore satisfies supp (u) C J\. A direct computation 
of the gradient of the loss L(w) = ^ ||y — Xw|| 2 gives VL(w) = Eu — q, where q = 
^-X T s. From this we deduce that uj x = Sj j (Vj 1 L(w) + qjj, and since, by Lemma 11, 
— Vj 1 L(w) E A n IIj 1 ^4.(w), there exists olj x E IIj lk 4.(w) such that we have 



Vj 2 L(w) = Sj 2Jl u - qj 2 = Ej^E^ (-A^aj, + qj^ 



qj 2 



To show that w is a feasible solution to (25) it is enough to show that Mg E £/2(w*), || V^L(w) || < 



A n dp. But since the noise has bounded variance, 



^Ji^j^qji - <u 2 = ~ xT 



J 2 



~ X * S JiJi X I 



is y^-consistent; an d by the union bound we get ^(Vg E (?2(w*), ||V^L(w)|| < \ n d g ) > 
1 _ Z)^g^ 2 (w*) ^(ll V ^( w )ll > d g ). We therefore deduce that, for any g E ^(w*), 



1 

A n 



|V 3 L(w)||< 



-1/2n 



By Lemma 21, we have that IIj 1 ^4(w) is an upper hemicontinuous correspondence so that 

p 

— ¥ w} implies that 



max a t 
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Since we chose A such that \ n lr n 1 / 2 —¥ 0, we have 

^||V g L(w)||< S^S^a^w*) +o p (l). 

This shows that, under (C2), w is a feasible solution to (25) whose group-support is con- 
tained in (?i(w*) 5 i.e., we have shown Qi(w) C £?i(w*). 

For the necessary condition, by contradiction, consider a solution supported on J\. 
Then, reusing the previous argument we have 

^-||V 9 L(w)||> E^S^aj^w*) -o p (l), 

which shows that for the optimality conditions of Lemma 11(b) to hold, condition (CI) is 
necessary. ■ 

The previous theorem shows some partial consistency result in the sense that it guarantees 
that no group outside of the group-support will be selected. Since w also converges with 
high probability in Euclidean norm to w*, this implies for the support that with high 
probability 

supp (w*) C supp (w) C Ji(w*) . 

However, the theorem does not guarantee that all groups in £yi(w*) will be selected. This 
is not a shortcoming of the theorem: we provide an example in Appendix B which shows 
that it is possible that Qi(w) C (yi(w*) with probability 1. Nonetheless, we also show in the 
same appendix that with high probability there exists v* E V(w*) whose group-support is 
included in ^i(w). 

Theorem 24 With assumptions (H1,H2) and for \ n —> and A^n 1 / 2 — >► oc ; condition (CI) 
is sufficient for the strong group-support of the solution of (25), £/i(w) ; to satisfy with high 
probability: 

&(w*) C&(w) C&(w*). 

Proof The previous theorem shows that (CI) implies, with high probability, ^i(w) C 
£/i(w*). However, by Lemma 22, we have that hypothesis (H2) guarantees that w \-> V(w) 
is continuous at w* for w with supp (w) C Ji(w*). Combined with the fact that w converges 
in probability with w* 3 this implies that Ve > 0, 3no, Vn > no, with probability larger than 
1 — e, Vv* E V(w*), there exists v E V(w) such that ||v — v*|| < e. For each g E ^i(w*), 
for v* E V(w*) such that v*^ ^ 0, there thus exists e > such that the previous conver- 
gence results implies that g E ^i(w) with high probability. Finally, since |(yi(w*)| is finite, 
for n large enough, the union bound ensures that, with high probability, (/i(w*) C (/i(w). ■ 



The previous theorem shows the best result possible for the situation where £yi(w*) ^ 
£/i(w*), as, in the example of the cycle of length 3 of section 6.2, the case of w* = (2, 1, 1). 
If (/i(w*) = £7i(w*), then we have the obvious corollary: 

Corollary 25 With assumptions (H1,H2), and assuming (/i(w*) = £/i(w*) ; for X n — > 
and \ n n 1 / 2 — >► oc ; conditions (CI) and (C2) are respectively necessary and sufficient for the 
solution of (13) to estimate consistently the correct group- support (?i(w*). 



25 



Remarks: For the Lasso and the usual group Lasso with disjoint groups, the most 
favorable case w.r.t. to condition (C2) is the case where the empirical covariance of the 
design is the identity (the same analysis can be done in the random design case), i.e., the 
case where there is no correlations between groups. In that case, we have Ej^Ej 1 ^ = 
and the mutual incoherence condition is 0. However, in the case of overlap, for g G Q such 



that g fl J\ ^ 0, then S^Ejj ^ and we have 



E ^i S JiJi a ^i 



|a^ n ji ||- First, this 



gives yet another motivation to consider the weak-group support, since those groups in the 
weak-group support are exactly the ones for which ||o^nJill = 1 (see Lemma 14). Second 
this show that if g\ G £/i(w*) and g2 ^ £/i(w*) have a large overlap then Ha^n^ll can be 
fairly close to 1 even for a design with identity covariance. This means that it might be 
very difficult in practice to identify g2 correctly as being outside of the support unless large 
amounts of data are available. 



7.3 Related theoretical results 

Two papers proposed recently some theoretical results on the estimator via regularization 
by Q in the high-dimensional setting. Percival (2011) shows two types of results. First, he 
proposes a generalization of the restricted eigenvalue condition of Bickel et al. (2009) and 
generalize their proof to obtain fast-rate type of concentration results for the prediction 
error and convergence in i?2-norm. The bounds obtained scales as \/Slog(M), where M is 
the total number of groups and B is the largest group size. Then he considers an adaptive 
version of the regularization (in the sense of the adaptive Lasso) and shows for the resulting 
estimator a central limit theorem under high-dimensional scaling, under the conditions 
that the support is exactly a union of groups and that the decomposition of any point 
in a neighborhood of the optimum is unique. These results do not focus on support or 
group-support recovery. Also, it was one of our concerns to relax the assumption that the 
decomposition was unique or that the support was exactly a union of groups. 

Maurer and Pontil (2011) give a bound on the Rademacher complexity of linear func- 
tions whose parameter vector lies in the unit ball of the norm fiy, hence bounding the 
generalization error of such function. They consider as well extensions of this norm where 
each of the latent variables in the latent group Lasso are penalized by the norm of their 
image by some operator. 

Our paper and these two papers have thus considered complementary aspects of estima- 
tion and recovery in statistical and compressed sensing based on fiy settings which should 
all contribute to understanding the high-dimensional learning setting. 



8. Choice of the weights 

The choice of the weights d g associated to each group has been discussed in the literature 
on the classical group Lasso, when groups do not overlap. The main motivation for the 
introduction of these weights is to take into account the discrepancies of size existing between 
different groups. Yuan and Lin (2006) used d g = which yields solutions similar to the 

ANOVA test under a certain design. Bach et al. (2004) in the context of multiple kernel 
learning used d g oc yJtrK g , where {K g } ge g are positive definite kernels, with K g — X^Xj 
in our context; for normalized features such as XX T = /, this yields d g — y/\g\ as well. 
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In the context of our latent group Lasso with overlapping groups, the choice of the 
weights is significantly more important than in the case of disjoint groups, and, arguably, 
than in the case of other formulations considering overlapping groups: indeed, the notions 
of group-support Gi (w) and Gi (w) and of support J\ (w) and J\ (w) associated to a vector 
w through the norm fi(w) themselves change according to the choice of the weights. 

In this section we propose two types of arguments to study the effect of and guide the 
choice of weights: 

• On the one hand we consider a vector w and ask, independently of a learning problem, 
which groups participate in its group support: there is no point in introducing a group 
in Q if the weights are such that it can never be included in the group support. We 
show in Section 8.1 that, for all groups to be useful, weights should increase with the 
size of the groups, but not too quickly; in Section 8.2 we attempt to characterize when 
large groups are preferred over unions of smaller ones. 

• On the other hand, we consider in Section 8.3 a simple regression scenario, and discuss 
the impact of the weights on the probability to correctly identify relevant groups, and 
simultaneously control the rate of false positives. 

8.1 Redundant groups 

Informally, we are concerned in this section with the fact that, if a group g contains a 
group h and d g /dh is too small, h will never enter the group support, and, conversely, if g 
is covered by a certain number of groups and d g is too large, then g will never enter the 
group-support. 

Formally, we say that a group g E G is redundant for a certain set of weights (d g ) ge g if 
it can be removed without changing the value of the norm ft for any w; this is equivalent 
to asking that the dual norm fi* is unchanged. 

We first show that if there exists another group g' E G such that g C </, g is redundant 
unless we require that d g < d' g : 

Lemma 26 If g,g f E Q satisfy g C g' and d g > d g r, then for any w ; (g E £?i(w)) (g f E 
01 (w)). 

Proof If d s > d g >, and if g e Gi(w) then 1 = IK J w)l1 < l^^I, which implies g' e &(w). 

■ 

It would be very natural to try and require that the weights are chosen so that, if 
g — supp (w), its group-support is exactly g. Unfortunately, this is in general not possible: 
we show a negative result, which arises as a consequence of the previous lemma. 

Lemma 27 For some group sets G, it is impossible to choose the weights d g independently 
ofw so that Ji(w) = supp (w) (or Ji(w) = supp (w) ) if the latter is a union of groups. 

Proof Consider the groups A = {1, 2, 3}, B = {3, 4}, C = {2, 3, 4} : 

• To have that Ji(w) = supp (w) for all w Lemma 26 imposes that ds < dc so that B 
is not redundant; this is necessary to have Ji(w) = supp (w) = B for w = (0, 0, w, w). 
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• Then consider w = (0,it;,e,e). Ji(w) = supp (w) requires that ^i(w) = {C}. But 
then w c — w so that ct = dc w/||w||. In particular ||c*a|| 2 = (w 2 + e 2 )/(w 2 + 2e 2 ) 
and lla^H 2 = d 2 c 2e 2 /(w 2 + 2e 2 ). For the inequality ||c*a|| < g?a to hold for all e > 0, 
we need g?a > ^c- 

• Finally consider w = (e, e, w,0). Following the same line as for the previous case, 
Ji(w) = supp (w) requires that (/i(w) = {^4}, which implies that v A — w so that 
ol = gUw/||w||. In particular II^bII 2 = d\w 2 /(w 2 + 2e 2 ) and ||c*c|| 2 = d 2 A (w 2 + 
e 2 )/(w 2 + 2e 2 ). For the inequalities, \\olb\\ < and ||ac|| < to hold for all e > 0, 
we need to have < ds- 

These three inequalities are clearly incompatible and Ji(w) C Ji(w) which proves the re- 
sult. ■ 



We now characterize more technically redundancy. The intuition behind the next lemma 
is the following geometric interpretation of the dual norm: the definition of Q* implies that 
its unit ball is the intersection of cylinders of the form {a. \ \\ot g \\ < d g }. This means that 
a group g is redundant if its associated cylinder contains the unit ball of the norm induced 
by the remaining groups. This can be formally stated as follows: 

Lemma 28 A group g E Q is not redundant if and only if there exists ol E W such that 
\\ctg\\ > d g and\/h E G\{g}, \\ a h\\ < dh- 

Proof Define the unit balls: U = {a E W | V/i E Q, \\ct h \\ < d h } and lA g = {ol E W | V/i E 
G\{g}, \\oth\\ < dh}. We have that g is redundant for if and only if it is redundant for 
fi*, and the latter is true if and only if U — U g . Since U C U g , g is not redundant if and 
only if there exists ct <EU g \U. ■ 



Corollary 29 Let g E Q and% C Q such that g is covered by groups in T~L, i.e., g C Uhen h. 
Then g is redundant if d 2 > d\. 

hen 

Proof The fact that g is covered by groups in T~L implies that, for any ol E MP, ||o^|| 2 < 
J2heH ll a ^l| 2 - If ^ is part of the group-support, then necessarily d 2 = ||o^|| 2 < J2hen II a ^ II 2 — 

In particular, if all singletons are part of Q with d^y — 1, i E this imposes d g < y/\g\. 

In the case where the weights depend only on the cardinality of the g, i.e., d g = d k for 
\g\ — fc, we consider the following condition: 



Vfc > 1 , 4-1 < d k < y ^—y d k -i . (C) 
Lemma 30 Condition (C) is sufficient to guarantee that no group is redundant. 
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Proof Assume that (di)i<i< m satisfy condition (C), and let g E Q a group of cardinality 
k. Consider the vector ol — with l g E R p the vector with entry i equal to 1 for 



i £ g and else. Since \g\ = k we have ||a fi 



4. Note that (C) implies ^ < ^i, 
for any j < k. Now, for any group 



which more generally implies by induction < 
</ E of cardinality j < A:, we have 1 1 ay || < v^V^J < 4/- Similarly, if | | 



< ||aJ| — dk < dj, and if </ ^ g but | | = |g|, then ay < 



\OLr 



- j > k then 
= d k = d^/. 



Since lla^ 



and ay < for g' ^ g, it is possible to choose e > sufficiently small 



such that the vector ol' — ol + elg satisfies \\ct f g \ 
Lemma 28 then shows that g is not redundant. 



> d g and 



OL' 



< dg> for any g f ^ g. 



We would like insist that condition (C) is sufficient to guarantee non-redundancy but 
might be unnecessary for many restricted families of groups, for example as soon as each 
group contains an element which belongs to no other group. However, without any condition 
on the set of groups, the previous condition is the weakest possible if the weights depend 
only on the group sizes, since it becomes necessary in the following special case: 

Lemma 31 Assume that group g with cardinality \g\ — k contains all k groups of size k — l, 
then (C) is necessary for g to be non-redundant. 

Proof If g E Q is not redundant, by Lemma 28 we can find ol E W such that \\ct g \\ > d g 
and \\oLh\\ < dh for h E Q\{g}- I n particular, for all i E g, ||^\{i}|| 2 < ^k-i so that 

(k-i)4<(k-i) KH 2 = £ i6 J| 

a <A» II - k d l-i wnicn shows the result. ■ 



Condition (C) allows scalings of the weights which go from quasi uniform weights, in 
which case the larger groups dominate the smaller groups in the sense that they are prefer- 
ably selected, to weights that scale like \/&, in which case the smaller group dominate (and 
in particular if the singletons are included the norm approaches the ^i-norm). Condition 
(C) suggests to consider weights of the form dk = fc 7 , 7 E (0, \). We illustrate on Figure 4 
the trade-offs obtained with the groups Q = {{1}, {2}, {3}, {1, 2, 3}} and different 7. The 
first ball for 7 = is the ball we would have without considering the singletons since only 
the largest group is active. At the other extreme for 7 = ^ the ball is the one we would have 
without the {1,2,3} group since only the singletons are active. In intermediate regimes, 
all the groups are active in some region. More specifically, the second ball for J = \ cor- 
responds to a limit case that we present in Section 8.3, while the third one for 7 = ^ogll) 
illustrate another problem that we now introduce : the possibility that a group dominates 
other groups. Intuitively for 7 > 2^(3) ? ^ e *> ^ ^he sphere gets any smaller than on the 
third ball, it becomes impossible to select a support of exactly two covariates even though 
(i) such a support would be a union of groups and (ii) no group is redundant. We detail 
this notion in the next section. 

8.2 Dominating group 

Let us first formalize the notion of group domination. 
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Definition 32 Let g E Q and T~L C Q a set of subgroups satisfying Mh E H^h C g. VKe say 
t/iat g dominates HifH could be the weak group- support for some w if g was removed from 
Q , but is the weak group support of no w in the presence of g. 

We can characterize the presence of domination in terms of weights as follows: 

Lemma 33 A group g dominates a set of subgroups T~L if and only if, on the one hand, % 
is a possible group-support when g is removed from Q , and, on the other, 

d g < P(g,H) = min{||a^|| | a E R p and \\a. h \\ = d hl Mh E T~L} . 

Proof First note that the set of constraints \\oLh\\ = Mh E % is feasible since T~L is as- 
sumed to be a possible group support without g. Then note that the condition is equivalent 
to saying that the ball {ct g E M} g \ \ \\ct g \\ < d g } does not intersect the previous feasible set, 
which characterizes the set of possible dual variables for which the weak group-support is 
H. ■ 



As discussed previously, one natural property to require would be that if w is exactly 
supported by a group g, its group-support should be g. As argued in Lemma 27, we can 
not have this property in general. We can however show that if the support of w is a single 
group in then this group is always in the group support of w. 

The following result shows that, under some conditions on the weights, we can ensure 
that a group g does not dominate any set of subgroups that do not cover it entirely. 

Lemma 34 Let a group g E Q and a set of subgroups % C Q such that Mh £ T~L,h C g and 
^hehh $1 g. Assuming thatT-L could be in the group support of some w if g was removed from 
Q , then g does not dominate % if, for some constant d\ > 0, weights satisfy d^ < y/\h\di 
for all h EH and d g > y\g\ — ld\. 

Proof By Lemma 33, g does not dominate % if and only if d g > P{g 1 %). To prove this, 
let us rewrite P(g, T~L) as the solution of the following optimization problem: 

min x t 1q s.t. Mh E x T l/ l = d\ . 

By strong duality of linear programs P(g^H) is also the solution of the dual problem: 
max V u h d\ s.t. Mi E V u h l {ieh x < l {i(Eg} . 

But if h = Uhen hi under the conditions on the weights in Lemma 34, we can upper bound 
the optimal value as follows: 

u h d\ <djJ2 u h\ h \ = d2 i^2J2 u h^h} <4\gnh\< d\ (\g\ - 1) , 
hen hen leg hen 

where the second inequality results from the constraints of the dual program and the fact 
that for i E g\h, the corresponding terms in the sum are equal to 0. This shows that if 
d 2 g > (\g\ - l)dl then d g > P(g,H). ■ 

Note that Lemma 33 is tight in the following case: 
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Lemma 35 For any group g^Q,ifl~Lisa set of \g\ — 1 singletons of g, each with weight 
di, that could be in a group support if g was removed, then g dominates % if and only if 
d g < diy/\g\ - 1. 

Proof This is a direct consequence of Lemma 33, where the value of P(g,T-L) is trivially 
equal to d\ \J\g\ — 1. ■ 



What the two previous lemmata indicate is that, if there are large gaps in size between 
a group of size k and many much smaller subgroups contained in it, it is necessary to choose 
a value for the weight which is possibly unreasonably large, to allow all combinations of 
subgroups to be selected (even non-covering ones). Lemma 35 is illustrated on Figure 4, 
with the the group Q — {{1}, {2}, {3}, {1, 2, 3}}. Giving singletons the weight d\ — 1, the 
critical weight for g — {1, 2, 3} to dominate or not pairs of singletons is d g = \/\g\ — 1 = \/2. 

We represent it equivalently as d g — \g\ J with 7 = ^ogfl) on Fig ure 4. This corresponds to 
the critical value, below which it is not possible to select two singletons only . The trade-off 
we are facing here is not surprising when the weights are thought to correspond to code 
lengths. Indeed, in light of the interpretation of the norm ft as a relaxation of a block coding 
penalization, it is clear that allowing groups with quite large weights (i.e., code lengths) 
increases the expressiveness of the code at the expense of compressibility and reduces the 
strength of the prior on support, since large weight allows for a greater diversity of supports. 
Put more simply, there is a trade-off between how coarsely the supports are encoded and 
how informative the prior on the supports is. The trade-off can also be interpreted as 
a bias-variance trade-off, where biasing the estimate of the support with a coarser set of 
patterns reduces the variance in its estimation. 

It should be noted that, as an important consequence of domination, the set of possible 
sparsity patterns (although consisting of unions of sets of Q) is in general not stable by 
union. 

8.3 Importance of weights for support consistency, FDR and FWER control 

In this section we consider the following regression setting: 

1 9 
min - llw - w* + ell 2 + Aft(w) , (26) 

where the design matrix is taken to be the identity and the noise to be Gaussian, bearing in 
mind that the analysis we propose here could be extended easily to the case of a design satis- 
fying properties such as RIP with noise that could be taken more generally subgaussian. The 
mapping to the solution of this optimization problem is often called the soft-thresholding 
operator, shrinkage operator or proximal operator associated with the norm Q. We denote 
this mapping w \-> ST(w). In terms of support recovery and group-support consistency, a 
reasonable minimal requirement is that for sufficiently large values of the coefficients and 
for small levels of noise, assuming that the distribution of the noise is absolutely continuous 
with respect to the Lebesgue measure, the solution to problem (26) should retrieve the 
correct support, provided the latter can be expressed as a union of groups. 
We first show that redundant groups may never be selected by (26). 
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Lemma 36 Take Q — {g,g f } with g C g' and d g > d g > . Then for any w, g £ Qi(w) a.s. 
where w = ST(w). 

Proof We first note that the optimality condition for (26) is 

w - w* + e = -Xol , (27) 
where oc E A(w). We then reason by contradiction and assume g E {?i(w) so that 

|i/s||2 ^ 2 1 1 1 1 2 

||o^(w)|| = d g . Then, because g C g 7 , ||ay(w)|| = ||a^(w)|| + ck^/^|| < d g t, which 
implies OL g /\ g = = + - wy^. But w^/\ p + e^\ p 7^ a.s., this implies w g >\ g ^ 0, 

and therefore that v 9 ^ 0. But v 5 ' restricted to g'\g should then both be equal to by 
optimality condition, and be equal to vJ g >\ gi which is a contradiction. ■ 

Lemma 36 should be compared to Lemma 26. While the later one shows that g can not be 
selected without g 7 , Lemma 36 shows that in the regression setting it may simply not be 
selected a.s. This shows in particular that d g > d g > can pose a problem of support consis- 
tency because it implies that, if the only way to write the support as a union of elements 
of Q is supp (w) = g, the support is a.s. never correctly estimated by solving problem (26). 

We now discuss in more details the influence of the weights on the probability to select 
false positives (Section 8.3.1) and to have false negatives (Section 8.3.2) 

8.3.1 False positives 

Let us consider a group g E Q of size \g\ = k which is outside of the support (i.e. w* = 0), 
and such that not other group intersecting it is selected. From the optimality condition 
(27) we see that = if and only if ||e^|| 2 < A 2 g?|. 
If we assume that A = <7, then setting 



d k = yk + cVk (28) 

is an interesting choice because this is, at second order, the smallest possible rate that 
ensures that each group has a vanishingly small probability of being selected by chance. 
Indeed, on the one hand, ||e^|| 2 ~ cr 2 x| so the usual Chernoff bound yields: 

F(||e 3 || 2 > tfccr 2 ) < e -K*-log(*)-i) , 

and it is easy to verify that for t = 1 + with c sufficiently large, the above probability 
can be made arbitrarily small uniformly in k. This implies that if d^ is fixed according 
to (28), then with c large enough we can make the probability that g is selected as small 
as possible. On the other hand, choosing d^ smaller, i.e., d\ — k — o(\/fc), would fail to 
guarantee P(||e|| 2 < cr 2 d\) > 1 — 77 for k large because the central limit theorem implies 

that 4 AA(0,1). In summary, (28) is the smallest rate which ensures that we can 

control the probability of selecting a wrong group uniformly in k. Finally, note that for 
dk = \fk + cVk, condition (C) is satisfied; furthermore, we have ck± < d k < (1 + c)ki . In 
particular if we consider the case of c — )> oc, we retrieve a scaling of the form d^ — k^. 

Note that if we want to control the expected number of incorrectly selected variables 
instead of the number of incorrectly selected groups, then, using the same reasoning, but 
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based on a bound on the expected number of false positive of the form J2 g eG \d \ ^(^g > 
t\g\) we would show similarly that an appropriate choice for is d& = y/k~+ cy/k log A:. 
Obtaining a control of the type FWER instead of FDR is possible by choosing c oc yflogm. 
The reader probably noticed that the analysis in this section is ignoring the overlaps between 
groups, and for groups that have a quite significant overlap with a group of the support, the 
probability of being incorrectly selected is much larger. This issue can however be addressed 
by choosing c sufficiently large. Besides this point, the weights derived nonetheless satisfy 
constraints from the previous sections in which issues arising from overlaps were considered. 

8.3.2 False negatives 

These choices for allow to control for false positives, but it is interesting as well to ask 
which groups containing true non-zero elements will be selected, and which ones could be 
false negatives. For simplicity we assume that w* e {0, 1} and that the noise is Gaussian 
as previously. If the fraction of non-zero elements in w* is p and one assumes a null model 
Hq under which group g is unrelated to the nonzero pattern of w* then it is reasonable to 
model the number of non-zero elements in g as a binomial random variable Bin(k,p) with 
k — \g\. Using again the KKT conditions, if none of the groups intersecting g is selected, 
we will have w g — if and only if ||w* + e p || 2 < \ 2 d\. 

Since ||w*|| 2 ~ Bin(k,p) and ||e^|| 2 ~ ^xj^ we have E[||w* + e g\\ 2 } — kp + ka 2 and 

Vax(||w; + e g \\ 2 ) = Var(||w;|| 2 ) + E[(ejwp 2 ] + Var(||e p || 2 ) = kp(l - p) + ikpa 2 + 2ka\ 

If A 2 = p+a 2 and if d^ is chosen of the previous form d^ — \fk + c\/k, then, for an appro- 
priate choice of c, namely c — c' V / ^( 1 p ^^ a +2cr ; classical Chernoff bounds together with 

an analysis similar to that of the previous section shows that we have ||w* + e g \\ 2 > \ 2 d\ 
with probability decreasing exponentially in c' . Therefore in this model, groups selected can 
be interpreted as groups that are "enriched" in non-zero coefficients, where we call a group 
enriched if the number of non-zero coefficients in that group is significantly larger than for 
a random group of the same size. To put things differently the false negatives correspond 
to groups that do not have a significant number of non-zero elements. 

This property is certainly a feature that can be desirable, especially in the applications 
in genomics that we have in mind where it is common to test for biological processes (or 
other groups of genes) that are enriched in "active genes" . 

Note that if a group g has elements in common with another selected group </, the 
elements that are in g' are explained in part by g' and are therefore "discounted" for group 
g, in the sense that we only need 

K" E +e 9 || 2 <A 2 4. 

g'ng^0 

A group is therefore selected if it contains enough non zero components that it itself explains. 

It should be stressed that the previous analysis depends on the assumption that the 
components of w* are of the same order of magnitude and fails if the distribution of the 
entries of w* has a long tail. 
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Figure 5: Graph-Lasso : if the penalty leads to the selection of connected sets of covariates 
like the edges, the resulting pattern should be more connected on the graph. 



Finally, the analysis presented in these last two sections is heuristic is nature. It is by no 
means aimed at proving that a specific weighting scheme can be chosen universally for all 
possible collections of groups G, but rather solely motivated by the need for an initial set of 
criteria to guide this choice. It is likely that finer analyses, namely under high-dimensional 
scaling and dedicated to specific collections of groups are required to make more definite 
recommendations for the choice of the weights. It should be noted that a different view on 
the weights can be adopted by considering them as defined through a set function; this is 
the point of view adopted in Obozinski and F. (2011) which relates the behavior of £1 to 
the set-function. 

9. Graph Lasso 

We now consider the situation where we have a simple undirected graph (/, E), where the 
set of vertices / = [l,fc] is the set of covariates and E C / x / is a set of edges that 
connect covariates. We suppose that we wish to estimate a sparse model such that selected 
covariates tend to be connected to each other, i.e., form a limited number of connected 
components on the graph. An obvious approach is to use the norm fiy where Q is a set that 
generates connected components by union. For example, we may consider for Q the set of 
edges, cliques, or small linear subgraphs. As an example, considering all edges, i.e., Q — E 
leads to : 



Alternatively, we will consider in the experiments the set of all linear subgraphs of length 
k > 1. Although we have no formal statement on how to chose fc, it intuitively controls 
the size of the groups of connected variables which are selected, and should therefore be 
typically chosen to be slightly smaller than the size of the minimal connected component 
expected in the support of the model. 
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10. Experiments 

To assess the performance of our method when either overlapping groups or a graph are 
provided as a priori information, and subsequently, to assess the influence of the weights 
d g , we considered several synthetic examples of regression model in which the structure of 
the model generating the data matches the prior on supports induced by the norm. 

10.1 Synthetic data: given overlapping groups 

In this experiment, we simulated data with p = 82 variables, covered by 10 groups of 10 
variables with 2 variables of overlap between two successive groups: 

g= {{1,...,10},{9,...,18},...,{73,...,82}}. 

We chose the support of w to be the union of groups 4 and 5 and sampled both the 
coefficients on the support and the offset from i.i.d. Gaussian variables. Note that in this 
setting, the support can be expressed as a union of groups, but not as the complement of a 
union. Therefore, our latent group Lasso penalty f2y could recover the right support. 

The model is learned from n data points (x^,^), with yi — w T x^ + e, e ~ 7V(0, cr 2 ), 
a — |E(Xw + b)\. Using an £2 loss L(w) = ||y — Xw — 6|| 2 , we learn models from 100 such 
training sets. 

We report the empirical frequencies of the selection of each variable on Figure 6. For 
any choice of A, the Lasso frequently misses some variables from the support, while fiy does 
not miss any variable from the support on a large part of the regularization path. Besides, 
we observe that over the replicates, the Lasso never selects the exact correct pattern for 
n < 100. For n — 100, the right pattern is selected with low frequency on a small part of 
the regularization path. Jly on the other hand selects it up to 92% of the times for n — 50 
and more than 99% on more than one third of the path for n — 100. 

Figure 7 shows the root mean squared error for both methods and several values of n. 
For both methods, the full regularization path is computed and tested on three replicates 
of n training and 100 testing points. We selected the best parameter in average and used 
it to train and test a model on a fourth replicate. For a large range of n, fiy not only helps 
to recover the right pattern, but also decreases the MSE compared to the classical Lasso. 

10.2 Synthetic data: given linear graph structure 

We now consider the case where the prior given on the variables is a graph structure and 
where we are interested by solutions which are highly connected components on this graph. 
As a first simple illustration, we consider a chain in which variables with successive indices 
are connected. We use w E MP, p = 100, supp (w) = [20,40]. The nodes of the graph 
correspond to the parameters w\ and the edges to the pairs (1^,1^+1), i = 1, . . . , n. The 
parameters of the model and the 50 training examples (x^y^) are drawn using the same 
protocol as in the previous experiment. We use for the groups all the sub-chains of length 
k. Results are reported for various choices of k and compared to the Lasso (k = 1). 

Figure 8 shows the frequency of each variable selection over 20 replications. Here again, 
using a group prior improves pattern recovery, with better results as k increases. However, 
for larger groups, two consecutive groups are very correlated, which makes it more difficult 
to identify the exact boundaries of the support. 
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Figure 6: Frequency of selection of each variable with the Lasso (left) and fiy (right) for n — 
50 (top) and 100 (bottom). For each variable index (on the y-axis), its frequency 
of selection is represented in levels of gray as a function of the regularization 
parameter A (on the x-axis), both for the Lasso penalty and ft^. The transparent 
blue band superimposed indicates the set of covariates that belong to the support. 



10.3 Synthetic data: effect of the weights 

As discussed in Section 8, the choice of a set of weights {d g } ge g influences the variable 
selection behavior of the learning algorithm penalized by fi. At one extreme, if the weights 
are uniform, only groups that are included in no other can be selected. At the other extreme, 
for weights growing as the square root of the group size, the group-support selected will be 
composed (almost surely) of the smallest groups possible covering the support. 

To illustrate the effect of the weighting scheme on covariate selection, we run three 
experiments with respectively p = 100, 200, 300 covariates and n — 100, 50, 30 training 
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Figure 7: Root mean squared error of overlapped group lasso and Lasso as a function of 
the number of training points. 



points. In each setting, the groups are all the sets of size from 1 to 20 formed by sequences 
of consecutive covariates, much like in 10.2 but with more groups. Note that this creates 
a lot of nested groups. The support is formed by covariates with indices from 5 to 24 and 
from 90 to 92, i.e., 23 covariates. The noise level a 2 is 0.1. For each of the three settings, 
we compare 6 weighting schemes over 50 replications. The first 4 schemes follow (28) and 
assign d s = + c^/s to each group of size s, with c = 0, 1, 4, 6. We also try d s = tfs (the 
limit when c grows) and d s — 1. Note that d s — 1 and c = (d 8 — yfs) correspond to the 
two extreme regimes in condition (C). 

We evaluate the performance of the regularization in two different ways. First, we select 
by cross-validation the value of A that yields the smallest MSE and return the corresponding 
value. Second, we return the best possible recovery error attainable on the entire regular- 
ization path. We consider these two criteria since it is known that the regularization regime 
corresponding to optimal support recovery and best MSE are not the same (Bach, 2008b; 
Leng et al., 2004). 

Ideally, for support recovery, we would have to either use a theoretical value for A or to 
use the OLS-hybrid two-step procedure (Efron et al., 2004) in which the models obtained 
in sequence along the regularization path are refitted with OLS and tested on a held out 
set to select the best model. This would obviously lead to a much heavier experimental 
setting, which is why we simply return the best performance along the path. 

The results are shown in Table 1, 2 and 3. In each case, the best average MSE across 
the 50 runs and along the regularization path is given along with the corresponding point 
on the regularization path (A*), average number of selected variables in the corresponding 
model (Model size*), pattern recovery error of the selected model (Rec err*) and lowest 
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Figure 8: Variable selection frequency with Jly using the chains of length k (left) as groups, 
for k = 1 (Lasso), 2,4,8. For each variable index (on the y-axis), its frequency 
of selection is represented in levels of gray as a function of the regularization 
parameter A (on the x-axis), both for the Lasso penalty and fiy. The transparent 
blue band superimposed indicates the covariates that belong to the support. 



pattern recovery error along the regularization path (Rec err min). The pattern recovery 
error is the average of the proportion of covariates that were in the support and were not 
selected, and the proportion of covariates that were not in the support and were selected. 
The standard deviation is given for each measured quantity as well. The regularization path 
was approximated by a grid of 51 values of A between 2 -7 and 2 3 . For Table 2, a longer 
grid of 76 values starting at 2 -12 was used to make sure that the end of the regularization 
path was reached. 

The last column of Table 1 illustrates the effect of the weighting scheme on pattern 
recovery. 
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Figure 9: Variable selection for one of the 50 runs with Q{j using the chains up to length 
20 as groups and weights of the form dk = \[k + cy/k, c = 0,1,4, 6, oc and 
uniform weights (from left to right, top to bottom). A transparent blue band is 
superimposed to indicate the covariates that belong to the support. 



The results of Table 1 correspond to n — 100, p = 100 so that if s = 23 is the size of 
the support, we have n/(2s log(p)) « 0.47 which means that the sample size is slightly too 
small for the Lasso to recover the support exactly. Note that as expected from the theory, 
the fifth column shows that the model selected based on the MSE is not optimal in term 
of variable selection. The fourth column shows that more uniform weights encourage the 
selection of more variables, which is expected given that they favor the selection of larger 
groups. Lastly, the values of the MSE suggest that in this regime of sparsity, dimension 
and number of training points, the performances in pattern recovery have little influence 
on the MSE, because there are enough training points to deal with the noise created by 
the selection of spurious covariates. Here again however, the two extreme regimes lead to 
higher MSE. 

Figure 9 illustrates the influence of the weights on the selection behavior. As expected 
from theory, uniform weights = 1) only allow selection of the largest groups i.e., chains 
of size 20 while at the other extreme, for dk = y/k, only the small groups (singletons) 
are active. In intermediate regimes, all groups are active and allow to recover the correct 
support at some point on the regularization path, except c = 1 which on this particular 
run doesn't yield perfect recovery. More adequate choices of c lead to correct recovery on a 
larger portion of the regularization path. 

Table 2 corresponds to a harder regime, with fewer training points and in higher di- 
mension. As in the first regime, the fourth and last columns shows that the weighting 
scheme has a significant influence on the variable selection behavior, with more uniform 
schemes leading to more variables selected, and a better pattern recovery being achieved 
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Table 1: Effect of c on the MSE, the A giving the best average MSE, the pattern recovery 
error at the optimal MSE, and the best pattern recovery error possible. 100 
training points, 100 dimensions, 50 replications. 



c MSE A* Model size* Rec err* Rec err min 

0.06709 ±0.1814 0.02368 37.08 ± 12.8 0.1068 ± 0.07444 0.07148 ± 0.03768 

1 0.02891 ± 0.09583 0.01031 41.8 ± 18.4 0.1245 ±0.12 0.02951 ± 0.02057 
4 0.04513 ± 0.07202 0.0136 49.72 ± 27.21 0.1759 ± 0.1759 0.01468 ± 0.01599 
6 0.03877 ±0.1116 0.01031 45.78 ± 26.63 0.1506 ± 0.1741 0.01804 ± 0.01579 
d s = tfs 0.04318 ± 0.08945 0.0359 51.72 ± 27.11 0.1878 ± 0.1757 0.02461 ± 0.02585 
d s = 1 0.09263 ± 0.2278 0.04737 81.22 ± 17.16 0.3764 ± 0.1129 0.09788 ± 0.03598 



Table 2: Effect of c on the MSE, the A giving the best average MSE, the pattern recovery 
error at the optimal MSE, and the best pattern recovery error possible. 50 training 
points, 200 dimensions, 50 replications. 



c MSE A* Model size* Rec err* Rec err min 

8.264 ±5.187 0.04123 47.54 ±7.149 0.2706 ± 0.06144 0.2661 ± 0.06096 

1 6.317 ±4.809 0.0002441 61.3 ±3.824 0.1957 ± 0.07468 0.1823 ± 0.08499 
4 2.428 ±2.401 0.0002441 101.4 ± 13.74 0.2301 ± 0.04765 0.08716 ± 0.05194 
6 2.2 ±2.404 0.0002441 111.9 ± 17.29 0.2572 ± 0.05094 0.06944 ± 0.03839 
d(s) = tfs 1.66 ± 1.593 0.0007401 141.2 ± 15.52 0.3366 ± 0.04511 0.0823 ± 0.05281 
p(s) = 1 3.707 ±2.836 0.0002441 155.4 ± 14.44 0.3757 ± 0.0409 0.08228 ± 0.02283 



for an intermediate scheme (c = 6). The reason for the optimal c to be higher than in the 
previous regime may be that in higher dimension with less training points, it is not possible 
anymore to recover the fine structure of the true pattern and a better alternative is to select 
a less precise but more stable selection of larger groups. In terms of MSE, the minimum is 
reached for d s — tfs, and for all the other weightings the optimum A is the last one in the 
grid, for which a large fraction of the covariates have entered the model. 

In the last regime (30 training points, 300 dimensions), Table 3 shows that the best 
pattern recovery is performed with uniform weights, which suggests that at this level of 
noise, using the fine structure of the groups is more harmful than helpful, and that the best 
choice is to only use the largest groups. The same reasoning applies to the MSE. 

10.4 Breast cancer data: pathway analysis 

An important motivation for our method is the possibility to perform gene selection from 
microarray data using priors which are overlapping groups. Genes are known to modify each 
other's expression through various regulation mechanisms. More generally, some genes are 
known to be involved in the same biological function, so the presence of a particular gene 
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Table 3: Effect of c on the MSE, the A giving the best average MSE, the pattern recovery 
error at the optimal MSE, and the best pattern recovery error possible. 30 training 
points, 300 dimensions, 50 replications. 



c 




MSE 


A* 


Model size* 


Rec err* 


Rec err min 







18.78 ±7.021 


1.32 


15.74 ±3.451 


0.4059 ± 0.07167 


0.396 ± 0.07169 


1 




17.21 ±6.763 


0.5743 


23.22 ±3.501 


0.3841 ± 0.06413 


0.3693 ± 0.07547 


4 




17.21 ±8.195 


0.125 


51.5 ±10.74 


0.2281 ±0.1294 


0.2181 ±0.1285 


6 




14.74 ± 7.398 


0.125 


66.86 ± 17.36 


0.2037 ±0.1122 


0.1996 ±0.1198 


D 


(s) = <T* 


11.81 ±5.307 


0.007812 


119.8 ±23.15 


0.2259 ±0.08258 


0.1546 ±0.1197 


D 


(S) = 1 


11.82 ±5.31 


0.007812 


159.2 ±24.22 


0.268 ± 0.0401 


0.1284 ±0.05387 



in a predictive models can be indicative of the presence of related genes. In other words, 
when we select one gene in our predictive model, we can expect that genes which are known 
to either regulate or to be regulated by this gene, or more generally to be involved in the 
same biological function should also be selected. Since an increasing amount of information 
on gene interaction is being gathered from empirical biological knowledge and organized in 
databases (Subramanian et al., 2005), our hope is to use this information to : 

Improve prediction accuracy : Functions involving a small number of pre-defined gene 
sets, form a smaller hypothesis sets in which we can hope to better estimate. Since 
genes present in the same biological function are likely to be either all involved in the 
studied phenomenon (disease outcome, subtype, response to a treatment) or all not 
involved, we can expect to find a function predicting the phenomenon correctly in this 
class. 

Build accurate sparse prediction functions : Building sparse estimators has practical 
implications in this context because it is technically easier to measure the expression 
level of a small number of genes in a patient than a whole transcriptome. Selecting a 
small number of gene sets is a more robust procedure than selecting a small number 
of genes, because it is easy to spuriously select a gene from a noisy training set while 
the evidences add up for a set of genes. In addition, selecting a few genes that belong 
to the same functional groups could lead to increased interpretability of the signature. 

To reach this goal we use our fiy penalty with an (overlapping) predefined gene sets as 
groups. Several groupings of genes into gene sets are available in various databases. We 
use the canonical pathways from MSigDB (Subramanian et al., 2005) containing 639 groups 
of genes, 637 of which involve genes from our study. Among these, we restricted ourselves 
to the 589 groups that contained less than 50 genes. Indeed we observed empirically that 
keeping very large pathways in the penalty lead to poor regularization, which makes sense 
because the presence of very large groups allows the penalty to select a very large number 
of covariates at a low cost, partially breaking the purpose of regularization. As discussed in 
Section 8, it is possible to penalize large groups more heavily, but weighting cannot correct 
extreme size discrepancies such as combinations of groups of size two and groups of size 
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100. In addition, we are interested in identifying a small number of well defined biological 
functions that predict the outcome. Selecting a large pathway which contains one third of 
the genes would not be very informative. 

We use the breast cancer dataset compiled by van de Vijver et al. (2002), which consists 
of gene expression data for 8, 141 genes in 295 breast cancer tumors (78 metastatic and 
217 non-metastatic). We restrict the analysis to the 2465 genes which are in at least one 
pathway. Since the dataset is very unbalanced, we use a balanced logistic loss, weighting 
each positive example by the proportion of negative examples and each negative example 
by the proportion of positive examples. 

We estimate by 5-fold cross validation the balanced accuracy (average of specificity and 
sensitivity) of the balanced logistic regression with t\ and fiy penalties, using the pathways 
as groups. As a pre-processing, we keep the 500 genes most correlated with the output (on 
each training set). This type of preflrtering is common practice with microarray data, and 
all the results are quite robust to changes in the number of genes kept. A is selected by 
internal cross validation on each training set. 

In our experiments on this very noisy dataset, we noticed that results changed a lot 
with the choice of the split, often more than between methods. In order to make sure that 
observed differences were actually caused by algorithms and not by particular choices of the 
5 foldings, we repeated each experiment on 5 choices of the 5 foldings, and show the result 
for each of these choices separately. 

Table 4 gives the balanced accuracies using f}§ with and without weights, and using 
i\. We observe a consistent improvement in the performances when using fiy against i\ 
(between 2% and 12% depending on the fold). The weighted version of fig using c = 4 also 
leads to consistent improvement over t\ but is outperformed by the unweighted version of 
the penalty. Table 5 shows that the unweighted version of the penalty tends to select groups 
that are larger than average, since the average size of the initial set of pathways (after the 
preprocessing step that keeps only 500 genes) is 5 genes with a standard deviation of slightly 
above 5. The weighted penalty allows to correct this bias: it leads to the selection of groups 
of average size 5 but typically selects a much larger number of groups. 

Table 6 shows the average number of genes involved in the model learned by each of the 
methods. As expected, ft selects more genes, since it enforces sparsity at the gene set level 
but doesn't enforce sparsity at the gene level. Note however that the number of involved 
genes remains reasonable. As expected given the numbers of Table 5 the number of genes 
selected in the model learned by the weighted version of fig is even larger. 

Finally, we should mention, as a caveat, that the regularization coefficient was chosen 
here to minimize the classification error, i.e., in a regime which typically overestimates the 
support. A more tedious two-stage approach allowing to remove the bias of the estimator, 
would probably lead to smaller supports, as suggested by the comparison of Rec Err and 
Rec Err Min in Tables 1,2 and 3. 

10.5 Breast cancer data: graph analysis 

Another important application of microarray data analysis is the search for potential drug 
targets. In order to identify genes which are related to a disease, one would like to find groups 
of genes forming densely connected components on a graph carrying biological information 
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Table 4: Balanced classification error for the t\ and fiy (with and without weights) on 
average over 5 folds, for 5 different folding choices. 



Method 


n% 


Weighted 








Error folding 1 


0.29 ±0.05 


0.35 ±0.05 


0.36 


± 


0.04 


Error folding 2 


0.30 ±0.08 


0.39 ±0.05 


0.42 


± 


0.04 


Error folding 3 


0.34 ±0.14 


0.34 ±0.1 


0.37 


± 


0.10 


Error folding 4 


0.31 ±0.11 


0.33 ±0.07 


0.37 


± 


0.08 


Error folding 5 


0.35 ±0.05 


0.35 ±0.05 


0.37 


± 


0.05 



Table 5: Number (and size) of involved pathways in the Jly (with and without weights) 
signatures on average over 5 folds, for 5 different folding choices. 



Method 


n% 


Weighted 


Folding 1 
Folding 2 
Folding 3 
Folding 4 
Folding 5 


6 ± 1.225(16.73 ±2.378) 
12.6 ±7.765(13.86 ±3.589) 
7.6 ±3.209(14.86 ±2.584) 
8.6 ±7.266(16.7 ±4.477) 
8 ±1(14.82 ±1.191) 


45.8 ±21.11(5.35 ±0.6635) 
48.8 ±23.13(5.092 ±0.4939) 
43.8 ± 12.13(5.147 ±0.7176) 
30.6 ± 17.3(5.045 ±0.7267) 
48.4 ± 10.62(5.347 ±0.2867) 



such as regulation, involvement in the same chain of metabolic reactions, or protein-protein 
interaction. Similarly to what is done in pathway analysis, Chuang et al. (2007) built a 
network by compiling several biological networks and performed such a graph analysis by 
identifying discriminant subnetworks in one step and using these subnetworks to learn a 
classifier in a separate step. We use this network and the approach described in section 9, 
treating all the edges on the network as groups of size two, on the breast cancer dataset. 
Here again, we restrict the data to the 7910 genes which are present in the network, and 
use the same correlation-based pre-processing as for the pathway analysis to reduce the set 
to 500 genes. 

Table 7 shows the prediction accuracy of the balanced logistic regression with t\ and 
fiy. Both methods yield almost exactly the same performance in average, suggesting that 
this particular network is not a particularly informative prior for this learning problem. 

Nonetheless, while i\ mostly selects isolated variables on the graph, fi§ tends to select 
variables which are clustered into larger connected components. Table 8 shows, for each 
of the 5 foldings, the size of the largest connected component of the network restricted to 
the selected genes (the average and standard deviations are computed over the 5 folds of 
each folding). The average size of the largest connected component in the network after 
preprocessing (i.e., keeping only 500 genes in each training set) is 68. One might suspect 
that the increase of connectivity is merely caused by the fact that overall the fiy selects 
more genes. While it is clear that selecting more genes makes it more likely to select 
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Table 6: Number of involved genes in the t\ and fiy (with and without weights) signatures 
on average over 5 folds, for 5 different folding choices. 



Method 


n% 


Weighted 








Folding 1 


98 ±18 


159.4 ±60.1 


41.2 


± 


20.6 


Folding 2 


86.4 ± 18 


143.4 ± 32 


59.4 


± 


22.5 


Folding 3 


125 ±37.7 


156.4 ±36.7 


59.4 


± 


21.4 


Folding 4 


91.6 ±25 


115.2 ±57.9 


45.6 


± 


28.4 


Folding 5 


98 ±36 


178.4 ±33.9 


56 


± 


97 



Table 7: Balanced classification error of the l\ and f2y (using the edges as the groups) on 
the 5 folds. 

Method J x 



Folding 1 0.3625 ± 0.04538 0.3367 ± 0.03788 

Folding 2 0.4142 ± 0.05885 0.4042 ± 0.06035 

Folding 3 0.3681 ± 0.04773 0.3782 ± 0.07497 

Folding 4 0.3749 ± 0.06476 0.3834 ± 0.06449 

Folding 5 0.3317 ± 0.04318 0.3443 ± 0.04414 



larger connected components, the last two columns of Table 8 suggest that the increased 
connectivity is not simply caused by the selection of a larger number of genes. For example 
in folding 5, fiy selects many more genes than t\ but leads to the most modest increase 
in connectivity, while in folding 4 the number of selected genes is practically the same, 
although the fiy estimate is still much more connected than that of t\. 

This gain of connectivity without loss of prediction accuracy could potentially make the 
interpretation of the classifier and the search for new drug targets easier in practice. 

11. Conclusion 

We have presented the latent group Lasso, a generalization of the group lasso penalty 
which leads to sparse models with sparsity patterns that are unions of pre-defined groups of 
covariates, or, given a graph of covariates, groups of connected covariates in the graph. We 
studied various properties of the penalty function, and gave both sufficient and necessary 
conditions for group-support recovery, i.e., the correct recovery of the same union of groups 
as in the decomposition induced by the penalty on the true optimal parameter vector. 
We have highlighted the importance of setting weights correctly, and obtained promising 
empirical results on both simulated and real data. 

In future work it would be interesting to characterize further for which collections of 
groups the latent group Lasso penalty and the estimators obtained by regularizing with it 
are computable efficiently; which form of structures can be encoded via such collections; and 
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Table 8: Average size of the largest connected components and average number of genes 
selected by the l\ and fiy (using the edges as the groups) on the 5 folding. 



Method 


LARGEST CC 


i\ LARGEST CC 


f| GENES 


il jt GENES 


Folding 1 


10.2 ±5.586 


1.8 ±0.4472 


75.4 ± 47.54 


37.2 ± 17.68 


Folding 2 


6.2 ±3.633 


2±0 


58.4 ±30.81 


50 ±9.301 


Folding 3 


8.6 ±4.278 


2 ± 0.7071 


53.2 ±8.012 


43.2 ±5.357 


Folding 4 


8 ±6.205 


2.2 ± 0.4472 


48.6 ±30.25 


45.6 ± 20.63 


Folding 5 


6 ±3.082 


1.8 ±0.4472 


69 ±31.2 


37.2 ±12.3 



what are the appropriate choice of weights in those cases, which will have to be determined 
based on specific analyses of the consistency of these estimators under high-dimensional 
scaling. Finally, more systematic comparisons with other group Lasso formulations, such 
as that proposed by Jenatton et al. (2009), would be important. 
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Appendix A. Proofs of Lemmata 21 and 22 

Lemmata 21 and 22 are about the continuity of the correspondences w \-> A(w) and 
w i— > V(w). In order to prove them, we start by reviewing general results in correspondence 
theory (Section A.l), notably Berge's maximum theorem which is the main ingredient to 
prove the to lemmas. We prove Lemma 21 directly in Section A. 2. We then prove several 
continuity properties of auxiliary correspondences in Section A. 3 and A. 4 in order to finally 
prove Lemma 22 in Section A. 5. 

A.l Elements of correspondence theory 

We start with a couple of useful technical lemmas from correspondence theory. 

Lemma 37 If f is a continuous function at p and (j) is a correspondence u.h.c. (resp. 
l.h.c.) at f{p), then <j)o f is a correspondence u.h.c. (resp. l.h.c.) at p. 
If (j) : P —> X is a correspondence u.h.c. (resp. l.h.c.) at p and f is a continuous function 
on X then f o <j) is a correspondence u.h.c. (resp. l.h.c.) at p. 

Proof The proofs are straightforward from the definitions. ■ 



Lemma 38 An elementwise product of u.h.c. (resp. l.h.c.) correspondences is itself u.h.c. 
(resp. l.h.c). 

Proof It is easy to check that a cartesian product of l.h.c. (resp. u.h.c.) correspondences 
has itself the same property. Moreover, the product is a continuous application, so the 



50 



result is proved by Lemma 37. 



We now state without proof the celebrated maximum theorem (Berge, 1959). 

Theorem 39 (Berge maximum theorem) Let <j) : P -» X be a compact-valued corre- 
spondence. Let f : X x P ^ be a continuous real valued function. Define the "argmax" 
correspondence /i : P -» X by fi(p) = {x E (/>(p) \f(x,p) = niax^/^^^ f(x',p)}. If (p is 
continuous at p, then fi is non-empty, compact-valued and u.h.c. at p. 

A. 2 Proof of Lemma 21 

Lemma 21 is a simple consequence of Theorem 39. Indeed, remember that, by definition, 
A(~w) — argmax a a T w s.t. fl*(ct) < 1. Since (a,w) i->- a T w is continuous and since the 
correspondence w i— »> {ot E M p |fT(a) < 1} is compact- valued and continuous (it is con- 
stant), Theorem 39 applies and shows that the correspondence w \-> A(w) is u.h.c. (For 
more general results on the continuity of the subdifferential viewed as a multi-function see 
Hiriart-Urruty and Lemarechal (1994, chap. VI. 6. 2 p. 282)). ■ 



A. 3 Continuity properties of V(w), A(w) and Z(w) 

The fact that w i— >> V(w) is u.h.c. is also a direct consequence of Berge's maximum theorem. 
We show this in the following two lemmata. 

Lemma 40 The correspondence <j) defined by 

0(w) = {v E | w = v 5 ', sign{vf) — sign(wi), 1 < i < p} (29) 
is a continuous correspondence. 
Proof We have 0(w) = n?=i <M w i) with 
&(w0 = {(vDgeg e R m | w, = J^vf, Vz E g, sign(vf) = sign(w,), and vf = 0, i £ 5 }. 

It is easy to verify that a Cartesian product of compact-valued continuous correspondences 
is also continuous, so that we only need to show that fa is compact-valued and continuous. 
We therefore focus on fa(wi) C R m . First note that fa is compact valued because the sign 
constraints in the definition of fa imply that for all = (-v?) ge g E fa(wi) we have ||v^||i < 
|wi|. We first show that fa is u.h.c. Let U be an open set containing fa(wi). For two sets 

A,B C M m , we define d QO (A,B) = mf aeA ^ beB Wa-bW^. Let u E U c , d = doo({^o}, <M w i)) 

and define if = E M m | g?oo({^}, fa(yvi)) < do}- By construction if fl £7 C 7^ 0, and 
we have d OQ {U c 1 fa(wi)) = d OQ {U c D K,fa(wi)). Moreover, it is classical to show that the 
compactness of ^(w^) implies that if is compact as well. Since U c D K and fa(wi) are 
compact sets the inflmum in the definition of is attained, which means that there are 
u* E J7 C H if and v* E fa(wi) such that doc(?7 c fl if, <&(wi)) = ||u* - v*^. But we 
must have II u* — v* 1 1 co > otherwise u* — v* E U c f~l 0(w^) which would contradict the 
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hypothesis that <fii(wi) C U . If e — ||u* — v*|| 00 /2, we just showed that for all 8 E R m such 
that ||5||oo<£, 0*(wi) + 5c^. 

If = 0, then any decomposition of ± £, say is such that Hv^loo < £ 5 and 
0(w^ ±s) (Z U. If 7^ 0, w.l.o.g. assume that > 0; consider a decomposition E W 71 

of + e r with \e r \ < min(s, |w^|/2); if < then = + e'ei is a decomposition of 
and — Vi||oo < s 7 ; if £ 7 > then it is easy to show that the projection of on the 
simplex 0(wj) satisfies ||v^ — v^oo < e' . In all cases 4>(wi + e f ) C U for some £ > 0, which 
shows that is u.h.c. 

We can show similarly that <j) is l.h.c. : if E £7n0(w^), then for some e > 0, U contains 
a closed ball of radius e centered at v$, which contains a decomposition of ±e so that 
U n 0(wi ± e) + 0. ■ 



Lemma 41 T/ie correspondence w i-» V(w) compact- valued and u.h.c. 

Proof Define /(v, w) = H v ^H an< ^ ^ as m (^9)- 

We have that V(w) = Argmin^ G( ^( w )/(v, w) since it can be shown easily that any 
optimal decomposition satisfies sign(v?) = sign(w^). 

Since the previous lemma shows that is a compact- valued continuous correspondence, 
theorem 39 applies and proves the result. ■ 

Remember that A(w) C R m is the set of solutions to (10). For a vector A E R m we 
consider the vector £(A) E W defined by d(\) = J2 9 3i^gi an d denote Z(w) = {CM E 
W, A E A(w)}. 

Lemma 42 A(w) and Z(w) are u.h.c. correspondences. 

Proof Since V is u.h.c, by lemma 37, the continuity of (v g ) g eg \-> (\\v 9 \\) ge g shows that 
A(w) is u.h.c. and the continuity of A \-> {Ylg3i^g)i<i< snows that Zi(w) is u.h.c. ■ 



Lemma 43 For all i such that ^ ; Zi(w) is a singleton, and if we denote this unique 
value by Ci( w ) then the function w' \-> Ci( w ^ s uniquely defined in a neighborhood of w 
and it is continuous at w. 

Proof Uniqueness of Ci( w ) &t w such that ^ is granted by the fact that if ^ 0, 
then ai ^ 0, ol{ is unique (cf lemma 9) and the proof of lemma 6 shows that Q = Thus, 
Ci(w) is unique, but so is O( w f° r w ' m a small neighborhood of w since ^ 0. 

Moreover we have Ci( w ) — ^2 g eG ^9 ^ or an ^ ^ ^ A(w). Finally the upper hemicontinu- 
ity of w i-)- Zi(w) shown in the previous lemma implies the continuity of Q. ■ 
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Lemma 44 Let S = {u E R p \ supp (u) C J\}. Consider w such that Vz E Ji and for all u 
m a neighborhood of in S, Z^(w + u) is a singleton, then ifHg 1 denotes the projection on 
{A G R m | Xgc = 0} we have that 

A\ G j\ :S -» M) Ql1 

w' Ug 1 A(w') 

is a lower hemicontinuous correspondence at w. 

Proof Let B E RP Xm the adjacency matrix associated to £?, denned by = 1 if i E g 
and else. To simplify notations we denote B = Bj 1 ^ 1 the submatrix obtained by keeping 
rows in J\ and columns in Gi, C — Cj^w') an d A — n^A^ 7 ). Given £, then A = {Ag 
R+ | £ = BA} which means that if B + denotes the Moore-Penrose pseudo-inverse of B 
then A = (B+C + /Cer(B)) fl Rf . 

We now show that this correspondence is l.h.c. The uniqueness of £ implies its conti- 
nuity, since by lemma 42, Zi(w) is u.h.c. Denoting by H a matrix whose columns form a 
basis of /Cer(B), h 9 and b 9 the g th row of H and B + respectively, then an element of A 
is of the form (h 9 C + h^q) ge g 1 for some q. Given an element B + £ + Hq E U D R+ 1 ', we 

show that there exists an element A(w + u, q 7 ) = B + £(w + u) + Hq 7 E U fl R^ 1 ' for u in 
neighborhood of in S. Without loss of generality we can take U a cartesian product of 
open sets U = ® geGl U g . 

Let Q = {q 7 | B+C(w) + Hq 7 E R^ 1 '}. For all g E there exists qte) E Q such 
that b^C + hPq^) > 0. Set q 7 = (1 - e)q + ^ E^&q^. For e sufficiently small, 

A p (w, q 7 ) E E7p H R+, for all g E £i so that for u sufficiently small A^(w + u, q 7 ) E U g fl R+ 
as well. For all g tfi Gi, A g (w) = {0} and since A is u.h.c, for any 77 > 0, for u sufficiently 
small we have A^(w + u) C [0, 77), g £ Gi- Choosing 77 such that Mg £ Gi, [0, 77) C C/^ shows 
the result. ■ 



A. 4 Continuity properties of Gi and Gi 

Lemma 45 There exists a neighborhood U of in MP such that for all u E U with 
supp(u) C Ji(w), <?i(w + u) C <?i(w). 

Proof By definition of £/i(w + u), if g E £/i(w + u), then ct g (w + u) is unique by 
lemma 15, since g C Ji(w + u). For any # E £?i(w + u), g fl Ji(w) ^ 0; indeed 
if p fl Ji(w) = 0, then = = 0. If g C Ji(w), a 5 (w) is unique and since 
ct g (w + u) is unique, the upper hemicontinuity of A implies that oc g is continuous at 
w so that (||a^(w + u)|| = 1 => ||a^(w)|| = 1). If g\Ji(w) ^ 0, then it has to be the 
case that ^\Jx(w)( w + u) = 0, because it is indeed a possible value for a^j^w + u) 
(given that w^\j x ( w ) = u^\j x ( w ) = 0) and because a 5 (w + u) is unique. This implies 
that ||(cKpnJi(w)( w + u )ll — 1 an( i since a^nJi(w)( w ) is unique, upper hemicontinuity of 
A implies that w 7 \-> Q^nJi(w)( w/ ) is continuous at w so that we have by continuity 
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||c^(w)|| > ||«pnJi(w)( w )ll — 1 which proves that ||a^(w)|| = 1; but this is a contra- 
diction because this would imply g E Gi and therefore g C J\. ■ 

Lemma 46 Let V Jx = {u e W | ||u|| < 1, uj f = 0}; t/ien 

^i( w ) = H U &(w + eu). 

Proof One inclusion is already shown by the previous Lemma 45. For the other inclusion, 
let v be an optimal decomposition of w and ct the unique element of A(w) such that oljc = 
0. Let = ||v^||- The case of g E ^i(w) is straightforward, and we concentrate therefore 
on g E (?i(w)\£/i(w). By lemma 9, we have w — J2 ge g 1 ^g&g- Consider W( go , e ) — w + e ct go 
for some go E (?i(w)\£tl(w). By construction, ol E Vj x and for all j3 E W such that 
n*({3) < 1 we have 

W {go,e)P = J2 X 9 (X l^9 + 6 a Jo < A 2 + e = W (^o,e)« 

which shows that v' defined by v^ Q = ea 3o and w f g — v^, g 7^ go is an optimal decom- 
position of W(p 0j€ ) with group-support ^i(w) U go- Since this is true for any e and any 
go £ (5i(w)\^i(w), this proves the statement. ■ 



A. 5 Proof of Lemma 22 

We know from Lemma 41 that w 1— >> V(w) is a compact-valued u.h.c. correspondence. If 
supp (w) = J\ then lemma 43 implies that for all i E Ji, Ci( w + u ) is unique for all u in a 
neighborhood of 0. From lemma 44, this implies that u \-> Hg 1 A(w + u) is l.h.c at u = 0. 
This extends to u 1— > A(w + u) since we know from Lemma 45 that there exists a neigh- 
borhood of zero such that, for all u in that neighborhood, TlgcA(yv + u) = 0. Given that 
V(w + u) = a(w + u)A(w + u), since a(w) is l.h.c. from Lemma 21 and since a product 
of l.h.c. correspondences is l.h.c. (cf. Lemma 38), we have shown that u 1— )> V(w + u) is also 
l.h.c. at u = 0. ■ 



Appendix B. Partial group-support recovery 

Theorem 23, which only assumes hypothesis (HI), does not give a lower bound (in the sense 
of inclusion) for ^i(w), suggesting that hypothesis (H2) is necessary to guarantee group- 
support recovery. In this section, we first consider an example in which G±(w) is strictly 
included in £yi(w*). 

Example with partial recovery. Take Q — {{0, 1, 2}, {0, 1, 3}, {0, 2, 3}} for w = 

(^0,^1,^2,^3) E K 4 - It is easy to check that A{o,i,2} — 7(1^1 1 + 1^2 1 — I^D+j ^{0,1,3} — 
7(|^i I + 1^3| — |^2|)+ and A{o,2,3} — 7(1^2 1 + 1^3| — |^i|)+ with 7 determined by the equation 
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Y^=o vir = 1- In particular if we consider w* = (1, 0, 0, 0), then taking the identity as the 
design matrix and assuming independent Gaussian noise, we have y = (1 + 60,61,62,63) 
with 6{ i.i.d. AA(0,a 2 ). Thus solving the first order approximation of the KKT in the 
neighborhood of w* we get w = ((1 + 60 — A)+, 61, 62, 63). We have £tl(w*) = £7i(w*) = Q 
but for any value of a 2 , with probability and 1 — 3/x, ^i(w) takes respectively the 

values £\{0, 1, 2}, £\{0, 1, 3},5\{0, 2, 3} and Q, with /i « 0.216. 

However, the following lemma shows that the group-support recovered contains at least 
the group-support of one of the decomposition of the true support. 

Lemma 47 Ifw n is a sequence converging to w ; then denoting gsupp (v) the group support 
of a decomposition v ; we have 

3n ,Vn > n ,Vv n e V(w n ),3v e V(w), gsupp (v) C gsupp (v n ) . 

Proof Reason by contradiction and assume that 

Vn , 3n > n , 3v n e V(w n ), Vv e V(w), gsupp (v) ^ gsupp (v n ) . 

We can therefore extract a subsequence (w (/? ( n )) n with this property and the corresponding 
subsequence (v^( n )) n illustrating it. There exists at least one Go e 2 1^1 such that there 
are infinitely many elements v^( n ) in the subsequence which satisfies gsupp (v^( n )) = (?o- 
We consider the subsequence (v^/( n )) n composed of those elements. From the sequence 
(Vw) n ' since we can assume without loss of generality it lives in the compact set {v | 
£ || v5f || < 2||w||}, we can extract a converging subsequence (v^//( n )) n . Since (w^//( n )) n 
converges to w and by upper hemicontinuity of V(-) the subsequence (v^//( n )) n converges 
to an optimal decomposition of w. This implies that gsupp (vqo) C Go = gsupp (v^//( n )) 
which is a contradiction. ■ 

The simpler example with Q = {{1, 2}, {2, 3}} and w* = (0,1,0) could be expected 
to be problematic since (0,1,6) and (6,1,0) have respectively group-support {{2,3}} and 
{{1,2}}. However, this case is consistent since it can be shown that wi and W3 are almost 
surely non-zero, which implies that both groups are part of the group-support. 

Appendix C. Derivations for the illustrative examples 
C.l Graph Lasso for the cycle of length 3 

We consider the overlap norm in R 3 with groups Q — {{1, 2}, {1, 3}, {2, 3}}. If ol denotes a 
dual variable. The dual norm takes the form: 

ft* (a) = max (||(ai,a2)||, ||(ai,a 3 )||, ||(a 2 ,«3)||) 

By Fenchel duality, ft(w) = max a T w s.t. max ||a^|| 2 < 1. Consider the Lagrangian 

L*(a, A, w) —{a\w\ + a 2 w 2 + QL3W3) 

+ ^ [(A12 + A13) a\ + (A12 + A 23 ) Oil + (^13 + ^23) 0^ ~ (^12 + A13 + A 23 )] 
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and consider the optimization problem min L*(a, A, w) s.t. X Q > 0, g E Q. 
A singular point of the Lagrangian satisfies 

m = (A12 + Ai 3 ) ai, w 2 = (A12 + A23) «2, ^3 = (A13 + A23) as. (30) 

C.l.l At most two groups are active 

Assume that A13 = 0. Note that this case reduces to the case of Q — {{1, 2}, {2, 3}}, which 
is of interest on its own. Eq. 30 simplifies and the singular points of the Lagrangian solve 

w 1 = (\ 12 )a 1 , w 2 = (A12 + A 23 )«2, ^3 = (A 2 3)«3- (31) 

We assume first that A12 > 0, A23 > 0, \wi\ > 0, \w 3 \ > 0. Since, by complementary 
slackness, ||g?i2|| — 1 and ||^23|| — 1? using (30), we have 

22 22 

wt wi „ , wi wi 

TT~ + 7^ V^2 = 1 and 7^ V^2 + 7# = L 32 

A? 2 (A12 + A23) 2 (A12 + A23) 2 A| 3 v ; 

So that = or equivalently A23 = r^|Ai2 and by substitution in (32) we get respec- 



tively: 

\wi\ n/ m , \Ws\ 



\12 



-||(w 2 , |^i| + \w 3 \)\\ and A 23 = 1 r— -: r||(^2, |wi| + |^ 3 | 



1^1 1 + |^3 1 1^1 1 + |^3 1 

Substituting these expressions for A12 and A23 in the singular point equations (31), we get: 

. f x |^l| + |^ 3 | A ™2 fQQ , 

a x = signal) Try r— : rrrr and a 2 = T( , 7—. prrr. (33) 

||(^2, |™l| + |w 3 |)|| ||(^2, + F3|)|| 

0:3 has a similar expression as ai, where the roles of w% and w\ are exchanged. Finally, the 
decomposition is: 

V12 = (wi, | Wi 'hL 3 | ^2) and ^23 = ( kl faL 3 | ™2, ^3) , (34) 

and the norm then takes the closed form Q(w) = \\ (1V2, \ wi \ + \ws\) ||. Remains to consider 
the cases where w\ — 0, or w% = 0, which we do not develop here. 

C.1.2 All groups are active 

We first consider the case A12 > 0, A13 > 0, A23 > 0. By complementary slackness we have 
\\a g \\ = 1, g e G- Introducing (1 A12 + A13, ( 2 A12 + A23 and £ 3 A13 + A23, (30) rewrites 
as 

22 22 22 

wf wi wi Wo wt Wo 

— -I — 1 — -i ~ = 1 — l - -i ^ = 1 

Si S2 S2 S3 Si S3 

which taking pairwise differences yields: 

1 A w\ _ w\ _ w\ 

2-^2-^2 y 60 ) 



7 Ct d 
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Or in other words: 




'\ 1 
1 1 




7 VCs/ 7 \0 1 1 
which yields 

Al2 7(kl| + |^2| - |w 3 |), A i3 7(1^1 1 + |w 3 | - |W2|), A 2 3 7(|^2| + |^3| |^l|)- 

But since we have assumed A^ > 0, the solution found is only valid if no coordinate domi- 
nates in the sense that w E Wbai with 

Wbal = {w G M 3 ||it;i| < |^2| + |u>3|, |^2| < |^l I + |^3 1 5 |^s| < |^1 1 + |^2|} 
By re-substituting (35) in (30), we can solve for 7 and find that 



ol — —= sign(w) and thus f2(w) — —= ||w||i 
v 2 v 2 



The unit ball of the norm therefore has some flat faces. Finally, since (v g ) g is an optimal 
decomposition of w we have v g = X g ot g , the decomposition is unique and can be written 

1 fwi + (\w 2 \ - \w 3 \) sign(wi)\ _ 1 (w\ + (|w 3 | - |w 2 |) signal) 

nJ' V{13}_ 2 ^ 3 + (h 

1 /V2 + (1^3 1 - |wi |) sign(w 2 



V{12} 2 V^2 + (l^il - |w 3 |)sign(w 2 )y ' V{13} 2 Uu 3 + (|^i| - |w 2 |)sign(w 3 ) J ' 



and 



{23} 2 V^3 + (1^2 1 - |^i |) sign(w 3 ; 



If w ^ Wbai? then one of A12, A13 or A23 equals 0, and this reduces to the situation where 
only two groups are active which we considered in section C.l.l above. 

C.1.3 Closed form expression for the norm 

Finally, summarizing the analysis, we obtain the closed form expression: 

1 



V2 



w 1 



(w U \w 2 \ + |w 3 |) 
min { || (^2, |^l| + \w 3 \) 
(W3, \wi\ + \W 2 \) 



if W G Wbal 

else. 



C.2 Graph Lasso for the cycle of length 4 

We consider here the case where the groups are Q = {{1, 2}, {1, 3}, {2, 4}, {3, 4}}. This case 
is interesting because we will show that non-sparse w on the cycle always admit several 
optimal decompositions. The dual norm takes the form: 



fi*(a) = max (||(ai, a 2 )||, ||(ai,a 3 )||, 11(^2,^4)11, ||(«3,«4) 
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We use again Fenchel duality, write QCw) = max a w s.t. fT(a) < 1 and we construct 
the Lagrangian: 

L*(a, A, w) — —(aiwi + 0^2 + ^3^3 + Q.4W4) 

+ \ [Cl «1 + C2 «2 + C3 «i + C4 «4 ~ (Al2 + A 2 3 + A 2 4 + A34)] 

with Ci = A12 + A23, C2 = A12 + A24, C3 = A13 + A34 and £4 = A 2 4 + A34 A singular point of 
the Lagrangian satisfies W{ = Q a ii 1 < i < 4. 

C.3 All groups are active 

We first consider the case A12, A13, A24, A34 > 0. By complementary slackness 

\\a g \\ = l,geg (CS) 

which, using (30), rewrites as 

22 22 22 22 

wt wi wf Wo _ wi wi , Wo wi „ 

7 k + ^ = 1 > + + and 7^ + 7^ = L (36) 

Si S2 Si S3 S2 S4 S3 S4 

Taking differences between pairs of equations above that share a common variable w\ 
we get 

f|wi|(A 2 4 + A34) = |^4|(Ai2 + A i3 ) 
\|^2|(Al3 + A34) = |w 3 |(Ai2 + A24) 

Thus, isolating A12 in both equations and eliminating it yields 

\w\ I |^2| 

1 r (A24 + A34) - A13 = r(Ai3 + A34) - A24 

|?/;4 1 \Ws\ 

Now isolating A13 we get 

( |^2| \ _1 ( \ Wf A\ 

A13 =11+ 1 1 1 r(A24 + A34) + A24 - l rA34 

V \Ws\J \\W4\ \W 3 \ 



Adding A34 on both sides yields 

A13 + A34 



1 + \w 3 \ 



Inserting this expression into the only equation of (36) which doesn't contain A12 we get 



which reduces to 



1/3 V wt 



(l + ^) 2 (A 24 + A 3 ^ (^ + A 34 ) 2 



C4 = A24 + A34 



r^i + ^4 



|^2| + |^3|) 2 + (kll + |^4|) 2 



1 



(37) 
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By symmetry, we get similar expressions for A12 + A13, A12 + A24, and A13 + A34. Since 
(w) = A12 + A13 + A24 + A34, we get immediately that 



W 



(\w 2 \ + \w 3 \) 2 + + \w±\ 



\Wi\ + \W4\, \W2\ + \Ws\ 



The above derivation gave us values for £1, (2, (3, £4. We discuss now the existence and the 
uniqueness of the (A^)^. Given the vectors (gR 4 and A E R 4 we have £ = £>A where B is 
the incidence matrix of the groups, with = l{ ieg y. To be precise we have 







"1 


1 





0" 




Ai 2 \ 






1 





1 
















1 





1 




A24 












1 


i_ 




\\34J 



Clearly, in this case, B is not invertible, and the kernel of B is the span of (—1, 1, 1, — 1) T . 
Since the matrix is symmetric, /Cer(B) = Im(B) T , and since Ci + C4 — ^(w) = (2 + C3> we 
have Ci — C2 + C3 — C4 — 0- The vector A exists provided the pre-image of Q has a non-empty 
intersection with the positive orthant. Moreover, if all A are positive then the solution is 
not unique. The Moore-Penrose pseudo-inverse of B is 



B 





00 


3 


-1 


-1 


1 


00 


-1 


CO 


-1 


8 


-1 


3 


-1 


3 




-1 


-1 


00 


00 



Since £1 + (4 = (2 + C3 = w = ^( w )> the set of solutions is given by 



Ai 2 \ 




/ 


Ci \ 




(- 


-1\ 




( Ci + C2-5 


\ 


A13 


= B+ 




C2 


5 




1 


1 


Ci - C2 + 5 




A24 




-C2 


+ 2 




1 


2 


C2 - Ci + s 




\A34y 






-Clj 




\ 


-v 




\2co - Ci - C2 - 


s) 



for values of 6 such that A^ > 0. The latter constraint implies that we necessarily have 
IC2 -&!<<*< min(Ci + C2, 2uj-d- C2) 

W.l.o.g., we assume that £1 < (2 < w — C2 < ^ — Ci- I n that case the set °f solutions in A 
is parametrized by v E [0, 1] with 

Ai2 = fCi 3 A13 = (1 - z/)Ci, A 2 4 = C2-^Ci, A 34 = u - C2 - (1 - ^Cl- 
in particular, we see that setting v — or v — 1 respectively removes {1,2} and {1,3} 

from the group-support of v. 

The case considered here is an example of the situation where the decomposition is not 

unique, which is characterised by lemma 48 in the next section. 
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Appendix D. Uniqueness of the decomposition 

In this section we give necessary and sufficient conditions for the support to be unique. As 
in lemma 44, we consider B the incidence matrix of the groups defined by = l^ ieg y. As 
before we denote Gi the strong group-support, J\ — U ge g i g and Jo = supp(w). Denote 
by B j g the submatrix of B whose rows are indexed by elements of the support of w and 

whose columns are indexed by elements of Gi> 



Lemma 48 The decomposition is unique if and only if B has full 



row rank. 



Proof By lemma 7, the uniqueness of the decomposition is equivalent to the uniqueness 
of the solution A to problem (10), which we can rewrite 



mm 



™ \ E + \H X 9- ( 38 ) 

g3i 

Notice that only the terms indexed by i E Jo and g E Gi contribute. Since the objective is 
a proper closed convex function with no direction of recession, this optimization problem 
admits at least one solution (the proof is the same as for 1). Since the gradient of the 
previous objective depends on A^ only through Q = J2 9 3i ^gi i ^ ^o 3 then any other vector 
such that Cj — ~Bj g 1 ^g 1 is a l so solution. It is therefore clear that it is sufficient that 
the kernel of Tij Q g 1 is not trivial, i.e., T5j Q g 1 is row rank deficient, to have multiple solutions. 
Indeed let H E Rl J °l xX be a basis of the kernel of T5 Jo g 1 and consider that, by definition of 
Gi, for all g E Gi, > 0. As a consequence, there must exist a neighborhood U of in M, K 
such that for all q E ZY, A^ + Hq has positive components. Since Cj = ^j g 1 (^ i g 1 + Hq), 
we have that A<* + Hq is another solution of the KKT conditions. 

We now prove that ~Bj Q g 1 being of full row rank is sufficient to ensure the uniqueness of 
the decomposition. Indeed, we show next that when B j ^ is of full row rank, the hessian of 
the objective, restricted to the non-zero A^ of (38) is positive definite, so that the objective 
is strictly convex and the optimum is therefore unique. The hessian is Q = (Q gg ') g g / e g 1 
with 

%9'= E f ^ x3 = B J g 1 DB J g 1 and D = diagL^ E ,^ A .)- 3 A . 

Since D is a diagonal matrix with non-zero coefficients, H is p.s.d. iff B is full row rank 
which concludes the proof. ■ 
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